[
https://issues.apache.org/jira/browse/SOLR-13399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hoss Man reopened SOLR-13399:
-----------------------------
Since the new SplitByPrefixTest was committed as part of this jira, it has fail
a little over 5% of the time it's been run by jenkins -- on both master and
branch_8x.
All of these failures occur at the same {{assertTrue(slice1 != slice2)}} call
(SplitByPrefixTest.java:222) and all of the seeds i've tested appear to
reproduce reliably...
on master...
{noformat}
hossman@tray:~/lucene/dev/solr/core [j11] [master] $ ant test -Dtests.dups=10
-Dtests.failfast=no -Dtestcase=SplitByPrefixTest -Dtests.method=doTest
-Dtests.seed=4A09C6784BF1B28F -Dtests.multiplier=2 -Dtests.slow=true
-Dtests.locale=ar-YE -Dtests.timezone=MET -Dtests.asserts=true
-Dtests.file.encoding=UTF-8
...
[junit4] Tests summary: 10 suites, 10 tests, 10 failures
...
hossman@tray:~/lucene/dev/solr/core [j11] [master] $ ant test -Dtests.dups=10
-Dtests.failfast=no -Dtestcase=SplitByPrefixTest -Dtests.method=doTest
-Dtests.seed=75D9C45CAC5D0D22 -Dtests.slow=true -Dtests.locale=yo-BJ
-Dtests.timezone=Africa/Porto-Novo -Dtests.asserts=true
-Dtests.file.encoding=UTF-8
...
[junit4] Tests summary: 10 suites, 10 tests, 10 failures
...
{noformat}
On branch_8x...
{noformat}
hossman@tray:~/lucene/dev/solr/core [j8] [branch_8x] $ ant test -Dtests.dups=10
-Dtests.failfast=no -Dtestcase=SplitByPrefixTest -Dtests.method=doTest
-Dtests.seed=B980178A30F46BB3 -Dtests.multiplier=2 -Dtests.slow=true
-Dtests.locale=ko-KR -Dtests.timezone=Africa/Abidjan -Dtests.asserts=true
-Dtests.file.encoding=UTF-8
...
[junit4] Tests summary: 10 suites, 10 tests, 10 failures
...
{noformat}
> compositeId support for shard splitting
> ---------------------------------------
>
> Key: SOLR-13399
> URL: https://issues.apache.org/jira/browse/SOLR-13399
> Project: Solr
> Issue Type: New Feature
> Reporter: Yonik Seeley
> Assignee: Yonik Seeley
> Priority: Major
> Fix For: 8.3
>
> Attachments: SOLR-13399.patch, SOLR-13399.patch
>
>
> Shard splitting does not currently have a way to automatically take into
> account the actual distribution (number of documents) in each hash bucket
> created by using compositeId hashing.
> We should probably add a parameter *splitByPrefix* to the *SPLITSHARD*
> command that would look at the number of docs sharing each compositeId prefix
> and use that to create roughly equal sized buckets by document count rather
> than just assuming an equal distribution across the entire hash range.
> Like normal shard splitting, we should bias against splitting within hash
> buckets unless necessary (since that leads to larger query fanout.) . Perhaps
> this warrants a parameter that would control how much of a size mismatch is
> tolerable before resorting to splitting within a bucket.
> *allowedSizeDifference*?
> To more quickly calculate the number of docs in each bucket, we could index
> the prefix in a different field. Iterating over the terms for this field
> would quickly give us the number of docs in each (i.e lucene keeps track of
> the doc count for each term already.) Perhaps the implementation could be a
> flag on the *id* field... something like *indexPrefixes* and poly-fields that
> would cause the indexing to be automatically done and alleviate having to
> pass in an additional field during indexing and during the call to
> *SPLITSHARD*. This whole part is an optimization though and could be split
> off into its own issue if desired.
>
>
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]