[
https://issues.apache.org/jira/browse/SOLR-10255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Smiley updated SOLR-10255:
--------------------------------
Attachment: SOLR-10255.patch
I'm sure you do Michael :-)
I temporarily hacked the default "large"-ness to be true (provided the field is
otherwise compatible with large -- stored & not multi-valued) and ran the Solr
core tests so that I could tease out unforeseen issues. Thankfully there were
only two failures:
* some Luke test or another failed -- to be expected since it looks at the
underlying index.
* RealtimeGet wanted to reconstitute the document and get a real stored field
but didn't get it. And it skipped over the BinaryDocValues field. This'll
require some adjustment to approach...
Here's a substantially improved patch.
* Moved the index time large field handling from TextField.createField (now
unmodified) to Solr's {{DocumentBuilder}}.
** Here I added an emum {{Mode}} to clarify the circumstances in which the
DocumentBuilder is used thus refactoring it's API a little adjusting the call
sites. I also eliminated a couple convenience accessors like
{{AddUpdateCommand.getLuceneDocument}} that hid this important Mode intention.
I think this is a definite improvement as it's clear in what circumstances an
in-place-update happens vs realtime get etc., plus it's now clear we're
actually building this Document instead of "get"-ing one.
** Only in {{Mode.STANDARD_UPDATE}} does the large field handling occur.
** StrField ("string" field) works too; there's no explicit Solr FieldType
assumption, just that the field: be stored, not multi-valued, not produce a
number, must produce a string.
* In order to streamline/reduce-processing at retrieval time further, the
{{DocumentBuilder}} large field processing will insert a special marker stored
field value that is seen by {{SolrIndexSearcher.doc(...)}} methods.
** This allows us to easily _conditionally_ put large fields separately... like
only the actual long values, and maybe support multi-valued fields by handling
the first occurrence. This is all a TODO.
* Refactored the {{SolrIndexSearcher.doc(id,fields)}} loading to always use a
new {{SolrDocumentStoredFieldVisitor}} inner class which handles both lazy
field loading and large doc handling (and neither). This seems simpler than
trying to handle an increasingly large number of combinations of cases (doc
cache, lazy fields, fields filter non-null, large fields) as separate paths.
Unfortunately the doc() method taking a visitor is an additional path that
can't easily be combined.
* When creating the BinaryDocValuesField, I ported the large string handling in
GrowableByteArrayDataOutput.writeString for efficiently allocating a UTF8 byte
array to exactly the right size -- very important for large doc handling.
I've got a test failure to track down so this is definitely not bug free. New
tests need to be written. And I think this feature would go hand-in-hand with
a compressing BDV-only DocValuesFormat... which should be done in parallel to
this. And then it would be great if people could try this for real; and I
should come up with some synthetic benchmark.
> Large psuedo-stored fields via BinaryDocValuesField
> ---------------------------------------------------
>
> Key: SOLR-10255
> URL: https://issues.apache.org/jira/browse/SOLR-10255
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: David Smiley
> Assignee: David Smiley
> Attachments: SOLR-10255.patch, SOLR-10255.patch
>
>
> (sub-issue of SOLR-10117) This is a proposal for a better way for Solr to
> handle "large" text fields. Large docs that are in Lucene StoredFields slow
> requests that don't involve access to such fields. This is fundamental to
> the fact that StoredFields are row-stored. Worse, the Solr documentCache
> will wind up holding onto massive Strings. While the latter could be tackled
> on it's own somehow as it's the most serious issue, nevertheless it seems
> wrong that such large fields are in row-stored storage to begin with. After
> all, relational DBs seemed to have figured this out and put CLOBs/BLOBs in a
> separate place. Here, we do similarly by using, Lucene
> {{BinaryDocValuesField}}. BDVF isn't well known in the DocValues family as
> it's not for typical DocValues purposes like sorting/faceting etc. The
> default DocValuesFormat doesn't compress these but we could write one that
> does.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]