[
https://issues.apache.org/jira/browse/LUCENE-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12926573#action_12926573
]
Simon Willnauer commented on LUCENE-2700:
-----------------------------------------
Some might have followed the recent commit on the
[branch|https://svn.apache.org/repos/asf/lucene/dev/branches/docvalues/] some
didn't so I will sum up what has happened so far.
I integrated the currently named "DocValues" (we might need to rename it to
something like "PerDocValues" due to the naming conflict with func queries - I
will wait for other suggestions though) into the 4 dimensional Flex API and in
turn changed the FieldsConsumer and FieldsProducer interface to accept a new
"DocValuesConsumer" / "DocValuesProducer" (implementing Fields) receptively. We
have a default implementation for both of them while none of them are used by
the "Term / Postings" codecs yet. I added a DocValuesCodec which wraps any
other codec and forwards if there is a TermsConsumer / Producer requested. The
test case already uses a random codec wrapped by DocValuesCodec so they are
ultimately pluggable.
DocValues are supported on a SegmentReader as well as DirectoryReader level
i.e. they are integrated into MultiFields just the same way as Terms / DocsEnum
etc. are.
I run into one rather big issue while integrating a "PerDoc" consumer /
producer into Codec. When a codec instantiates a FieldsConsumer most of the
codecs already create all necessary "resources" to consumer terms and postings
which is problematic since PerDocConsumers are created way before the segment
is flushed while "TermConsumer" are created / needed only before / during
flush. So in the case of DocValues I pass in the SegmentsWriteState into
Codec#fieldsConsumer(..) and once the segment if flushed DocumentsWriter
creates another one which in turn fails since the files for this codec /
consumer have already been creates. Yet the solution I have implemented /
hacked :) is that I initialize the wrapped codec lazily with the
SegmentsWriteState passed to Codec#fieldsConsumer(..) before the flush. This
only works as long as nobody tries to get a TermsConsumer before we are ready
to flush which is kind of flaky.
IMO we should not necessarily create all resources / files in directory etc.
when a FieldsConsumer is created but move it one level down and do it onces a
TermsConsumer is requested. We gonna need these facilities anyway to integrate
StoredFields etc. since they are per doc too.
Comments welcome.
> Expose DocValues via Fields
> ---------------------------
>
> Key: LUCENE-2700
> URL: https://issues.apache.org/jira/browse/LUCENE-2700
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Index
> Reporter: Simon Willnauer
> Assignee: Simon Willnauer
> Fix For: CSF branch
>
>
> DocValues Reader are currently exposed / accessed directly via IndexReader.
> To integrate the new feature in a more "native" way we should expose the
> DocValues via Fields on a perSegment level and on MultiFields in the multi
> reader case. DocValues should be side by side with Fields.terms enabling
> access to Source, SortedSource and ValuesEnum something like that:
> {code}
> public abstract class Fields {
> ...
> public DocValues values();
> }
> public abstract class DocValues {
> /** on disk enum based API */
> public abstract ValuesEnum getEnum() throws IOException;
> /** in memory Random Access API - with enum support - first call loads
> values in ram*/
> public abstract Source getSource() throws IOException;
> /** sorted in memory Random Access API - optional operation */
> public SortedSource getSortedSource(Comparator<BytesRef> comparator) throws
> IOException, UnsupportedOperationException;
> /** unloads previously loaded source only but keeps the doc values open */
> public abstract unload();
> /** closes the doc values */
> public abstract close();
> }
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]