[jira] [Comment Edited] (SOLR-16524) Index time hash partitioning

Joel Bernstein (Jira) Mon, 07 Nov 2022 13:16:06 -0800


    [ 
https://issues.apache.org/jira/browse/SOLR-16524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17630031#comment-17630031
 ]


Joel Bernstein edited comment on SOLR-16524 at 11/7/22 9:15 PM:
----------------------------------------------------------------

Agreed, and initial plan for this ticket was not to do that. The initial plan 
was simply to add the partitions at indexing time which is pretty straight 
forward.

But I will say that I'm using this technique in a module that I haven't 
contributed yet and performance is quite surprising. Reading raw bytes in long 
sequential reads when you manage the byte buffers yourself (not using 
BufferedInputStream) is so fast its almost free. I got a lesson on how fast a 
computer really is when I first saw it run. It's all the layers of abstraction 
that are put on top of the bytes that adds up. So, if you simply scan a byte 
array and create hashes as your read it will be surprisingly fast. But this is 
probably not the issue to introduce this on. 




was (Author: joel.bernstein):
Agreed, and initial plan for this ticket was not to do that. The initial plan 
was simply to add the partitions at indexing time which is pretty straight 
forward.

But I will say that I'm using this technique in a module that I haven't 
contributed yet and performance is quite surprising. Reading raw bytes in long 
sequential reads when you manage the byte buffers yourself (not using 
BufferedInputStream) is so fast its almost free. I got a lesson on how fast a 
computer really is when I first saw it run. It's all the layers of abstractions 
that are put on top of the bytes that adds up. So, if you simply scan a byte 
array and create hashes as your read it will be surprisingly fast. But this is 
probably not the issue to introduce this on. 



> Index time hash partitioning
> ----------------------------
>
>                 Key: SOLR-16524
>                 URL: https://issues.apache.org/jira/browse/SOLR-16524
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Joel Bernstein
>            Priority: Major
>
> Both Streaming Expressions and Spark-Solr currently rely on query time hash 
> partitioning using the HashQParserPlugin. The query time hash partitioning, 
> although extremely flexible, is very slow when it builds its initial filters. 
> This ticket will add an indexing time hash partitioner that Streaming 
> Expressions and Spark-solr will both be able to use.
> When this ticket is complete I'll also update the ParallelStream and 
> Spark-Solr to be able to use the index time partitioning rather than the 
> HashQParserPlugin.
> This is a stepping stone towards much more performant parallel distributed 
> joins.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SOLR-16524) Index time hash partitioning

Reply via email to