[jira] Commented: (LUCENE-868) Making Term Vectors more accessible

Yonik Seeley (JIRA) Mon, 09 Jul 2007 14:52:25 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511267
 ]


Yonik Seeley commented on LUCENE-868:
-------------------------------------

I haven't really used the term vector APIs, but I like the goal of allowing the 
app to handle things.
What about dropping down a level lower, and not constructing the arrays or 
TermVectorOffsetInfo either?
Perhaps something like:

public interface TermVectorMapper {
  void setExpectations(String field, int numTerms, boolean hasOffsets, boolean 
hasPositions);
  void mapTerm(String term, int frequency)
  void mapTermPos(int startOffset, int endOffset, int position)
}

One could have an implementation of TermVectorMapper that collected the offsets 
and positions into an array as your patch does now.  I'm not sure if there 
would be a noticable performance impact to a method call per term instance or 
not.

Oh, wait...  I just went and looked at the readTermVector() code, and positions 
and offsets aren't stored interleaved, so one would have to do a sequence of 
mapTermPos() followed by a sequence of mapTerm Offset(), which makes less sense 
than what you have now.

Might also consider using an abstract class instead of an interface in case you 
want to make backward-compatible tweaks later.

> Making Term Vectors more accessible
> -----------------------------------
>
>                 Key: LUCENE-868
>                 URL: https://issues.apache.org/jira/browse/LUCENE-868
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Store
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: LUCENE-868-v1.patch
>
>
> One of the big issues with term vector usage is that the information is 
> loaded into parallel arrays as it is loaded, which are then often times 
> manipulated again to use in the application (for instance, they are sorted by 
> frequency).
> Adding a callback mechanism that allows the vector loading to be handled by 
> the application would make this a lot more efficient.
> I propose to add to IndexReader:
> abstract public void getTermFreqVector(int docNumber, String field, 
> TermVectorMapper mapper) throws IOException;
> and a similar one for the all fields version
> Where TermVectorMapper is an interface with a single method:
> void map(String term, int frequency, int offset, int position);
> The TermVectorReader will be modified to just call the TermVectorMapper.  The 
> existing getTermFreqVectors will be reimplemented to use an implementation of 
> TermVectorMapper that creates the parallel arrays.  Additionally, some simple 
> implementations that automatically sort vectors will also be created.
> This is my first draft of this API and is subject to change.  I hope to have 
> a patch soon.
> See 
> http://www.gossamer-threads.com/lists/lucene/java-user/48003?search_string=get%20the%20total%20term%20frequency;#48003
>  for related information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-868) Making Term Vectors more accessible

Reply via email to