[
https://issues.apache.org/jira/browse/SOLR-16524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17630031#comment-17630031
]
Joel Bernstein edited comment on SOLR-16524 at 11/7/22 9:15 PM:
----------------------------------------------------------------
Agreed, and initial plan for this ticket was not to do that. The initial plan
was simply to add the partitions at indexing time which is pretty straight
forward.
But I will say that I'm using this technique in a module that I haven't
contributed yet and performance is quite surprising. Reading raw bytes in long
sequential reads when you manage the byte buffers yourself (not using
BufferedInputStream) is so fast its almost free. I got a lesson on how fast a
computer really is when I first saw it run. It's all the layers of abstraction
that are put on top of the bytes that adds up. So, if you simply scan a byte
array and create hashes as your read it will be surprisingly fast. But this is
probably not the issue to introduce this on.
was (Author: joel.bernstein):
Agreed, and initial plan for this ticket was not to do that. The initial plan
was simply to add the partitions at indexing time which is pretty straight
forward.
But I will say that I'm using this technique in a module that I haven't
contributed yet and performance is quite surprising. Reading raw bytes in long
sequential reads when you manage the byte buffers yourself (not using
BufferedInputStream) is so fast its almost free. I got a lesson on how fast a
computer really is when I first saw it run. It's all the layers of abstractions
that are put on top of the bytes that adds up. So, if you simply scan a byte
array and create hashes as your read it will be surprisingly fast. But this is
probably not the issue to introduce this on.
> Index time hash partitioning
> ----------------------------
>
> Key: SOLR-16524
> URL: https://issues.apache.org/jira/browse/SOLR-16524
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Joel Bernstein
> Priority: Major
>
> Both Streaming Expressions and Spark-Solr currently rely on query time hash
> partitioning using the HashQParserPlugin. The query time hash partitioning,
> although extremely flexible, is very slow when it builds its initial filters.
> This ticket will add an indexing time hash partitioner that Streaming
> Expressions and Spark-solr will both be able to use.
> When this ticket is complete I'll also update the ParallelStream and
> Spark-Solr to be able to use the index time partitioning rather than the
> HashQParserPlugin.
> This is a stepping stone towards much more performant parallel distributed
> joins.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]