[jira] [Commented] (SOLR-10117) Big docs and the DocumentCache; umbrella issue

2017-03-09 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903287#comment-15903287
 ] 

David Smiley commented on SOLR-10117:
-

bq. Would this deduplicate large fields replicated in multiple records?

No.  If I were tasked to do that, I might implement a customized 
DocValuesFormat that deduplicated per segment (could not dedup at higher tiers) 
by using the length as a crude hash and then verifying the dedup by re-reading 
the original.  Query time would be no overhead; it'd simply share the internal 
offset/length pointer.

> Big docs and the DocumentCache; umbrella issue
> --
>
> Key: SOLR-10117
> URL: https://issues.apache.org/jira/browse/SOLR-10117
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: David Smiley
>Assignee: David Smiley
> Attachments: SOLR_10117_large_fields.patch
>
>
> This is an umbrella issue for improved handling of large documents (large 
> stored fields), generally related to the DocumentCache or SolrIndexSearcher's 
> doc() methods.  Highlighting is affected as it's the primary consumer of this 
> data.  "Large" here is multi-megabyte, especially tens even hundreds of 
> megabytes. We'd like to support such users without forcing them to choose 
> between no DocumentCache (bad performance), or having one but hitting OOM due 
> to massive Strings winding up in there.  I've contemplated this for longer 
> than I'd like to admit and it's a complicated issue with difference concerns 
> to balance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-10117) Big docs and the DocumentCache; umbrella issue

2017-03-09 Thread Alexandre Rafalovitch (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903226#comment-15903226
 ] 

Alexandre Rafalovitch commented on SOLR-10117:
--

No help with code I am afraid, but a question:

Would this deduplicate large fields replicated in multiple records? We advise 
people to denormalize records, so if big stored fields were actually stored 
once (e.g. repeated description pushed into child records), it would probably 
be a marketing argument, not just a technical one.

> Big docs and the DocumentCache; umbrella issue
> --
>
> Key: SOLR-10117
> URL: https://issues.apache.org/jira/browse/SOLR-10117
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: David Smiley
>Assignee: David Smiley
> Attachments: SOLR_10117_large_fields.patch
>
>
> This is an umbrella issue for improved handling of large documents (large 
> stored fields), generally related to the DocumentCache or SolrIndexSearcher's 
> doc() methods.  Highlighting is affected as it's the primary consumer of this 
> data.  "Large" here is multi-megabyte, especially tens even hundreds of 
> megabytes. We'd like to support such users without forcing them to choose 
> between no DocumentCache (bad performance), or having one but hitting OOM due 
> to massive Strings winding up in there.  I've contemplated this for longer 
> than I'd like to admit and it's a complicated issue with difference concerns 
> to balance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-10117) Big docs and the DocumentCache; umbrella issue

2017-03-08 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902542#comment-15902542
 ] 

David Smiley commented on SOLR-10117:
-

spinning off SOLR-10255 for BinaryDocValues based approach.  I could have used 
a JIRA sub-task but I'm not a fan when the issue space is a bit exploratory.

> Big docs and the DocumentCache; umbrella issue
> --
>
> Key: SOLR-10117
> URL: https://issues.apache.org/jira/browse/SOLR-10117
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: David Smiley
>Assignee: David Smiley
> Attachments: SOLR_10117_large_fields.patch
>
>
> This is an umbrella issue for improved handling of large documents (large 
> stored fields), generally related to the DocumentCache or SolrIndexSearcher's 
> doc() methods.  Highlighting is affected as it's the primary consumer of this 
> data.  "Large" here is multi-megabyte, especially tens even hundreds of 
> megabytes. We'd like to support such users without forcing them to choose 
> between no DocumentCache (bad performance), or having one but hitting OOM due 
> to massive Strings winding up in there.  I've contemplated this for longer 
> than I'd like to admit and it's a complicated issue with difference concerns 
> to balance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-10117) Big docs and the DocumentCache; umbrella issue

2017-02-21 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15876640#comment-15876640
 ] 

David Smiley commented on SOLR-10117:
-

Another technique that I think makes a lot of sense is to cap the stored value 
to a configurable amount -- a cap after which there can be no highlighting of 
course.  This can be achieved even without an explicit Solr feature with a 
copyField with {{maxChars}} set.  Although it may hinder 
{{hl.requireFieldMatch=true}} if one chooses to go that route.

> Big docs and the DocumentCache; umbrella issue
> --
>
> Key: SOLR-10117
> URL: https://issues.apache.org/jira/browse/SOLR-10117
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: David Smiley
>Assignee: David Smiley
> Attachments: SOLR_10117_large_fields.patch
>
>
> This is an umbrella issue for improved handling of large documents (large 
> stored fields), generally related to the DocumentCache or SolrIndexSearcher's 
> doc() methods.  Highlighting is affected as it's the primary consumer of this 
> data.  "Large" here is multi-megabyte, especially tens even hundreds of 
> megabytes. We'd like to support such users without forcing them to choose 
> between no DocumentCache (bad performance), or having one but hitting OOM due 
> to massive Strings winding up in there.  I've contemplated this for longer 
> than I'd like to admit and it's a complicated issue with difference concerns 
> to balance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-10117) Big docs and the DocumentCache; umbrella issue

2017-02-14 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867396#comment-15867396
 ] 

David Smiley commented on SOLR-10117:
-

Another idea that I'm starting to like even more I think about it is to put 
large fields into BinaryDocValues, with compression (either at DocValuesFormat 
(codec) layer, or at Solr layer).  For very large fields, I think column stored 
(hence docValues) actually makes more sense than the stored field codec (row 
stored).  Then at the Solr layer we add docValues support to TextField (as 
BinaryDocValues), and then also at the Solr layer enable 
SolrIndexSearcher.doc() to see {{useDocValuesAsStored}} fields thus enabling 
highlighting to see it.  I wish I had thought of this earlier.

> Big docs and the DocumentCache; umbrella issue
> --
>
> Key: SOLR-10117
> URL: https://issues.apache.org/jira/browse/SOLR-10117
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: David Smiley
>Assignee: David Smiley
> Attachments: SOLR_10117_large_fields.patch
>
>
> This is an umbrella issue for improved handling of large documents (large 
> stored fields), generally related to the DocumentCache or SolrIndexSearcher's 
> doc() methods.  Highlighting is affected as it's the primary consumer of this 
> data.  "Large" here is multi-megabyte, especially tens even hundreds of 
> megabytes. We'd like to support such users without forcing them to choose 
> between no DocumentCache (bad performance), or having one but hitting OOM due 
> to massive Strings winding up in there.  I've contemplated this for longer 
> than I'd like to admit and it's a complicated issue with difference concerns 
> to balance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-10117) Big docs and the DocumentCache; umbrella issue

2017-02-09 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860835#comment-15860835
 ] 

David Smiley commented on SOLR-10117:
-

A couple tangential issues worth addressing:
* {{QueryComponent.doPreFetch}} logic should be configurable or perhaps simply 
never do it and make the highlighting component smart enough to ensure the 
applicable docs are in the cache with fl + hl.fl + id.
* {{UnifiedSolrHighlighter}} loads only the IDs in a way that will likely have 
bad cache performance.
* {{SolrIndexSearcher#doc(docId,StoredFieldVisitor)}} doesn't populate the 
DocumentCache; it only reads from it if present.  It's more work but it'd be 
nice if it populated it as well with not only the "needsField(field)" == true 
results but perhaps also by detecting "fl".

> Big docs and the DocumentCache; umbrella issue
> --
>
> Key: SOLR-10117
> URL: https://issues.apache.org/jira/browse/SOLR-10117
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: David Smiley
>Assignee: David Smiley
>
> This is an umbrella issue for improved handling of large documents (large 
> stored fields), generally related to the DocumentCache or SolrIndexSearcher's 
> doc() methods.  Highlighting is affected as it's the primary consumer of this 
> data.  "Large" here is multi-megabyte, especially tens even hundreds of 
> megabytes. We'd like to support such users without forcing them to choose 
> between no DocumentCache (bad performance), or having one but hitting OOM due 
> to massive Strings winding up in there.  I've contemplated this for longer 
> than I'd like to admit and it's a complicated issue with difference concerns 
> to balance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-10117) Big docs and the DocumentCache; umbrella issue

2017-02-09 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860833#comment-15860833
 ] 

David Smiley commented on SOLR-10117:
-

Some ideas I have rejected or have serious doubts with:
* Make Document implement {{Accountable}} then use LRUCache's configurable RAM 
size.  The problem with this is that lazy field loading and Documents which 
would basically grow in size after they have already been measured by the 
cache.  And lazy field loading is important, especially with large documents.
* A special VeryLazyField perhaps subclassing LazyDocument.LazyField that 
_always_ goes to disk instead of keeping a reference.  I started down this 
approach but stopped because I realized that multi-valued fields would be 
difficult to handle, likely resulting in terrible performance as each value 
would need to go to seek/decompress the document on its own.
* Black-list certain fields from being placed onto the Document, thus never 
making it into the cache.  A highlighter (or anything else) that used the 
StoredFieldVisitor API would be able to reach it though.  I'm worried that such 
a feature would have unintended breakage of other things; perhaps atomic 
updates or who-knows what.

One key thing to understand is how LazyDocument works and the semantics of lazy 
field loading.  Essentially, the moment you refer to any field that wasn't 
loaded eagerly (loaded expressly the first time), _all_ fields of the document 
are loaded.  If some fields are big, this is bad.

The kernel of an idea I feel is most promising is _either_ have lazy field 
loading be on a configurable set of fields only (others are always eager), _or_ 
leave lazy field loading as is but have an additional configuration to 
designate some fields as "very large" (potentially so, any way).  In the 
latter, these fields would be backed by a _separate_ LazyDocument instance and 
thus their loading would not be triggered by lazy loading of other fields.  In 
the former case (restrict which fields are lazy), the intention is that if you 
have very big fields that you'd configure lazy field loading to just them.

With that kernel of an idea in place, the next piece is revisiting the cache 
semantics of using the {{SolrIndexSearcher#doc(docId,StoredFieldVisitor)}} 
method which is currently only used by the UnifiedHighlighter, 
PostingsHighlighter (it's ancestor), and oddly in distributed grouping.  The 
latter ought to be adjusted to not use it, by the way -- very simple.  This 
method currently will detect a document cache entry and use it, and will 
indirectly trigger lazy field loading as a consequence.  But for very big 
fields, we don't want that to happen.  So perhaps we could change the cache 
semantics a bit such that if a very large field is requested, it skips the 
cached doc and goes straight to disk loading the fields that way.

I'm very interested in any feedback on my thoughts on this.  There are some 
tangential issues as well that could very well be sub-tasks here.

> Big docs and the DocumentCache; umbrella issue
> --
>
> Key: SOLR-10117
> URL: https://issues.apache.org/jira/browse/SOLR-10117
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: David Smiley
>Assignee: David Smiley
>
> This is an umbrella issue for improved handling of large documents (large 
> stored fields), generally related to the DocumentCache or SolrIndexSearcher's 
> doc() methods.  Highlighting is affected as it's the primary consumer of this 
> data.  "Large" here is multi-megabyte, especially tens even hundreds of 
> megabytes. We'd like to support such users without forcing them to choose 
> between no DocumentCache (bad performance), or having one but hitting OOM due 
> to massive Strings winding up in there.  I've contemplated this for longer 
> than I'd like to admit and it's a complicated issue with difference concerns 
> to balance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org