[ 
https://issues.apache.org/jira/browse/LUCENE-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12926573#action_12926573
 ] 

Simon Willnauer commented on LUCENE-2700:
-----------------------------------------

Some might have followed the recent commit on the 
[branch|https://svn.apache.org/repos/asf/lucene/dev/branches/docvalues/] some 
didn't so I will sum up what has happened so far.
I integrated the currently named "DocValues" (we might need to rename it to 
something like "PerDocValues" due to the naming conflict with func queries - I 
will wait for other suggestions though) into the 4 dimensional Flex API and in 
turn changed the FieldsConsumer and FieldsProducer interface to accept a new 
"DocValuesConsumer" / "DocValuesProducer" (implementing Fields) receptively. We 
have a default implementation for both of them while none of them are used by 
the "Term / Postings" codecs yet. I added a DocValuesCodec  which wraps any 
other codec and forwards if there is a TermsConsumer / Producer requested. The 
test case already uses a random codec wrapped by DocValuesCodec so they are 
ultimately pluggable. 
DocValues are supported on a SegmentReader as well as DirectoryReader level 
i.e. they are integrated into MultiFields just the same way as Terms / DocsEnum 
etc. are.


I run into one rather big issue while integrating a "PerDoc" consumer / 
producer into Codec. When a codec instantiates a FieldsConsumer most of the 
codecs already create all necessary "resources" to consumer terms and postings 
which is problematic since PerDocConsumers are created way before the segment 
is flushed while "TermConsumer" are created / needed only before / during 
flush. So in the case of DocValues I pass in the SegmentsWriteState into 
Codec#fieldsConsumer(..) and once the segment if flushed DocumentsWriter 
creates another one which in turn fails since the files for this codec / 
consumer have already been creates. Yet the solution I have implemented / 
hacked :) is that I initialize the wrapped codec lazily with the 
SegmentsWriteState passed to Codec#fieldsConsumer(..) before the flush. This 
only works as long as nobody tries to get a TermsConsumer before we are ready 
to flush which is kind of flaky. 

IMO we should not necessarily create all resources / files in directory etc. 
when a FieldsConsumer is created but move it one level down and do it onces a 
TermsConsumer is requested. We gonna need these facilities anyway to integrate 
StoredFields etc. since they are per doc too. 

Comments welcome.

> Expose DocValues via Fields
> ---------------------------
>
>                 Key: LUCENE-2700
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2700
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>             Fix For: CSF branch
>
>
> DocValues Reader are currently exposed / accessed directly via IndexReader. 
> To integrate the new feature in a more "native" way we should expose the 
> DocValues via Fields on a perSegment level and on MultiFields in the multi 
> reader case. DocValues should be side by side with Fields.terms  enabling 
> access to Source, SortedSource and ValuesEnum something like that:
> {code}
> public abstract class Fields {
> ...
>   public DocValues values();
> }
> public abstract class DocValues {
>   /** on disk enum based API */
>   public abstract ValuesEnum getEnum() throws IOException;
>   /** in memory Random Access API - with enum support - first call loads 
> values in ram*/
>   public abstract Source getSource() throws IOException;
>   /** sorted in memory Random Access API - optional operation */
>   public SortedSource getSortedSource(Comparator<BytesRef> comparator) throws 
> IOException, UnsupportedOperationException;
>   /** unloads previously loaded source only but keeps the doc values open */
>   public abstract unload();
>   /** closes the doc values */
>   public abstract close();
> }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to