[ 
https://issues.apache.org/jira/browse/SOLR-10255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Smiley updated SOLR-10255:
--------------------------------
    Attachment: SOLR-10255.patch

I'm sure you do Michael :-)

I temporarily hacked the default "large"-ness to be true (provided the field is 
otherwise compatible with large -- stored & not multi-valued) and ran the Solr 
core tests so that I could tease out unforeseen issues.  Thankfully there were 
only two failures:
* some Luke test or another failed -- to be expected since it looks at the 
underlying index.
* RealtimeGet wanted to reconstitute the document and get a real stored field 
but didn't get it.  And it skipped over the BinaryDocValues field.  This'll 
require some adjustment to approach...

Here's a substantially improved patch.
* Moved the index time large field handling from TextField.createField (now 
unmodified) to Solr's {{DocumentBuilder}}.
** Here I added an emum {{Mode}} to clarify the circumstances in which the 
DocumentBuilder is used thus refactoring it's API a little adjusting the call 
sites.  I also eliminated a couple convenience accessors like 
{{AddUpdateCommand.getLuceneDocument}} that hid this important Mode intention.  
I think this is a definite improvement as it's clear in what circumstances an 
in-place-update happens vs realtime get etc., plus it's now clear we're 
actually building this Document instead of "get"-ing one.
** Only in {{Mode.STANDARD_UPDATE}} does the large field handling occur.
** StrField ("string" field) works too; there's no explicit Solr FieldType 
assumption, just that the field: be stored, not multi-valued, not produce a 
number, must produce a string.
* In order to streamline/reduce-processing at retrieval time further, the 
{{DocumentBuilder}} large field processing will insert a special marker stored 
field value that is seen by {{SolrIndexSearcher.doc(...)}} methods.
** This allows us to easily _conditionally_ put large fields separately... like 
only the actual long values, and maybe support multi-valued fields by handling 
the first occurrence.  This is all a TODO.
* Refactored the {{SolrIndexSearcher.doc(id,fields)}} loading to always use a 
new {{SolrDocumentStoredFieldVisitor}} inner class which handles both lazy 
field loading and large doc handling (and neither).  This seems simpler than 
trying to handle an increasingly large number of combinations of cases (doc 
cache, lazy fields, fields filter non-null, large fields) as separate paths.  
Unfortunately the doc() method taking a visitor is an additional path that 
can't easily be combined.
* When creating the BinaryDocValuesField, I ported the large string handling in 
GrowableByteArrayDataOutput.writeString for efficiently allocating a UTF8 byte 
array to exactly the right size -- very important for large doc handling.

I've got a test failure to track down so this is definitely not bug free.  New 
tests need to be written.  And I think this feature would go hand-in-hand with 
a compressing BDV-only DocValuesFormat... which should be done in parallel to 
this.  And then it would be great if people could try this for real; and I 
should come up with some synthetic benchmark.

> Large psuedo-stored fields via BinaryDocValuesField
> ---------------------------------------------------
>
>                 Key: SOLR-10255
>                 URL: https://issues.apache.org/jira/browse/SOLR-10255
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: David Smiley
>            Assignee: David Smiley
>         Attachments: SOLR-10255.patch, SOLR-10255.patch
>
>
> (sub-issue of SOLR-10117)  This is a proposal for a better way for Solr to 
> handle "large" text fields.  Large docs that are in Lucene StoredFields slow 
> requests that don't involve access to such fields.  This is fundamental to 
> the fact that StoredFields are row-stored.  Worse, the Solr documentCache 
> will wind up holding onto massive Strings.  While the latter could be tackled 
> on it's own somehow as it's the most serious issue, nevertheless it seems 
> wrong that such large fields are in row-stored storage to begin with.  After 
> all, relational DBs seemed to have figured this out and put CLOBs/BLOBs in a 
> separate place.  Here, we do similarly by using, Lucene 
> {{BinaryDocValuesField}}.  BDVF isn't well known in the DocValues family as 
> it's not for typical DocValues purposes like sorting/faceting etc.  The 
> default DocValuesFormat doesn't compress these but we could write one that 
> does.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to