[ 
https://issues.apache.org/jira/browse/SOLR-10806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sachin Goyal updated SOLR-10806:
--------------------------------
    Description: 
Our Solr nodes go down within 20-30 minutes of indexing.
It does not seem that load-rate is too high because the exception in the logs 
is pointing to a data problem:

{color:darkred}
INFO  - 2017-06-02 23:21:19.094; org.apache.solr.core.SolrCore; 
\[node-instances_shard2_replica3\] Registered new searcher 
Searcher@6740879c\[node-instances_shard2_replica3\] 
main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_ne(6.3.0):C200591/8616:delGen=20)
 Uninverting(_wx(6.3.0):C72132/697:delGen=5) 
Uninverting(_y0(6.3.0):c5798/27:delGen=3) 
Uninverting(_yv(6.3.0):c10935/827:delGen=2) 
Uninverting(_z4(6.3.0):C4163/2277:delGen=1)))}
ERROR - 2017-06-02 23:21:19.105; org.apache.solr.core.CoreContainer; Error 
waiting for SolrCore to be created
java.util.concurrent.ExecutionException: org.apache.solr.common.SolrException: 
Unable to create core \[node-instances_shard2_replica3\]
        at java.util.concurrent.FutureTask.report(FutureTask.java:122)
        at java.util.concurrent.FutureTask.get(FutureTask.java:192)
        at 
org.apache.solr.core.CoreContainer.lambda$load$1(CoreContainer.java:526)
        at org.apache.solr.core.CoreContainer$$Lambda$38/199449817.run(Unknown 
Source)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
        at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$9/1611272577.run(Unknown
 Source)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.solr.common.SolrException: Unable to create core 
\[node-instances_shard2_replica3\]
        at org.apache.solr.core.CoreContainer.create(CoreContainer.java:855)
        at 
org.apache.solr.core.CoreContainer.lambda$load$0(CoreContainer.java:498)
        at 
org.apache.solr.core.CoreContainer$$Lambda$37/1402433372.call(Unknown Source)
        ... 6 more
Caused by: java.lang.NumberFormatException: Invalid shift value (64) in 
prefixCoded bytes (is encoded value really an INT?)
        at 
org.apache.lucene.util.LegacyNumericUtils.getPrefixCodedLongShift(LegacyNumericUtils.java:163)
        at 
org.apache.lucene.util.LegacyNumericUtils$1.accept(LegacyNumericUtils.java:392)
        at 
org.apache.lucene.index.FilteredTermsEnum.next(FilteredTermsEnum.java:232)
        at org.apache.lucene.index.Terms.getMax(Terms.java:169)
        at 
org.apache.lucene.util.LegacyNumericUtils.getMaxLong(LegacyNumericUtils.java:504)
        at 
org.apache.solr.update.VersionInfo.getMaxVersionFromIndex(VersionInfo.java:233)
        at 
org.apache.solr.update.UpdateLog.seedBucketsWithHighestVersion(UpdateLog.java:1584)
        at 
org.apache.solr.update.UpdateLog.seedBucketsWithHighestVersion(UpdateLog.java:1610)
        at org.apache.solr.core.SolrCore.seedVersionBuckets(SolrCore.java:949)
        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:931)
        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:776)
        at org.apache.solr.core.CoreContainer.create(CoreContainer.java:842)
        ... 8 more
{color}

It does not seem right that Solr Node itself should go down for such a problem.
# Error waiting for SolrCore to be created
java.util.concurrent.ExecutionException: org.apache.solr.common.SolrException: 
Unable to create core
# Unable to create core
# NumberFormatException: Invalid shift value (64) in prefixCoded bytes (is 
encoded value really an INT?)

i.e. Core creation fails because there was some confusion between long and 
integer.
If there is a data issue then somehow it should communicate it with an 
exception during ingestion.

\\
\\
*UPDATE*:
Another issue I see with the above problem is that solr cluster is completely 
inaccessible.
Solr-UI is also not coming up. I restarted the Solr servers and they refuse to 
recover.
I am not even able to delete the collections and create them afresh.
It seems the only way out is to do an *rm -rf* and re-install

Note that it is not related to network as I can ssh to the Solr machines and 
send messages to other Solr machines using nc

\\
\\
*UPDATE 2*:
I had a 24 node cluster with 2 collections.
Each collection used  6 nodes and had 2 shard, 3 replica configuration.
So 12 nodes used out of 24 nodes.
Rest 12 nodes had Solr running with same zookeeper but no collections/cores.
After the above errors begin to happen, Solr-UI of all 24 nodes became 
unresponsive!

So I tried the delete-collection API from the command line - no response.
Ultimately I ran the delete-collection from the command line in a loop and it 
deleted a part of the collection.
Then I had to manually delete the *<coreName>/data/index/write.lock* file on 
some nodes to purge those bad collections.
Its been a few hours since then. There are no collections and still few nodes 
are unresponsive with following messages in the logs:
{color:brown}
INFO  - 2017-06-03 06:40:51.308; org.apache.solr.core.SolrCore; Core 
sync-status_shard1_replica2 is not yet closed, waiting 100 ms before checking 
again.
INFO  - 2017-06-03 06:40:51.408; org.apache.solr.core.SolrCore; Core 
sync-status_shard1_replica2 is not yet closed, waiting 100 ms before checking 
again.
INFO  - 2017-06-03 06:40:51.508; org.apache.solr.core.SolrCore; Core 
sync-status_shard1_replica2 is not yet closed, waiting 100 ms before checking 
again.
INFO  - 2017-06-03 06:40:51.608; org.apache.solr.core.SolrCore; Core 
sync-status_shard1_replica2 is not yet closed, waiting 100 ms before checking 
again.
{color}

It looks like a serious stability problem to me.

  was:
Our Solr nodes go down within 20-30 minutes of indexing.
It does not seem that load-rate is too high because the exception in the logs 
is pointing to a data problem:

{color:darkred}
INFO  - 2017-06-02 23:21:19.094; org.apache.solr.core.SolrCore; 
\[node-instances_shard2_replica3\] Registered new searcher 
Searcher@6740879c\[node-instances_shard2_replica3\] 
main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_ne(6.3.0):C200591/8616:delGen=20)
 Uninverting(_wx(6.3.0):C72132/697:delGen=5) 
Uninverting(_y0(6.3.0):c5798/27:delGen=3) 
Uninverting(_yv(6.3.0):c10935/827:delGen=2) 
Uninverting(_z4(6.3.0):C4163/2277:delGen=1)))}
ERROR - 2017-06-02 23:21:19.105; org.apache.solr.core.CoreContainer; Error 
waiting for SolrCore to be created
java.util.concurrent.ExecutionException: org.apache.solr.common.SolrException: 
Unable to create core \[node-instances_shard2_replica3\]
        at java.util.concurrent.FutureTask.report(FutureTask.java:122)
        at java.util.concurrent.FutureTask.get(FutureTask.java:192)
        at 
org.apache.solr.core.CoreContainer.lambda$load$1(CoreContainer.java:526)
        at org.apache.solr.core.CoreContainer$$Lambda$38/199449817.run(Unknown 
Source)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
        at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$9/1611272577.run(Unknown
 Source)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.solr.common.SolrException: Unable to create core 
\[node-instances_shard2_replica3\]
        at org.apache.solr.core.CoreContainer.create(CoreContainer.java:855)
        at 
org.apache.solr.core.CoreContainer.lambda$load$0(CoreContainer.java:498)
        at 
org.apache.solr.core.CoreContainer$$Lambda$37/1402433372.call(Unknown Source)
        ... 6 more
Caused by: java.lang.NumberFormatException: Invalid shift value (64) in 
prefixCoded bytes (is encoded value really an INT?)
        at 
org.apache.lucene.util.LegacyNumericUtils.getPrefixCodedLongShift(LegacyNumericUtils.java:163)
        at 
org.apache.lucene.util.LegacyNumericUtils$1.accept(LegacyNumericUtils.java:392)
        at 
org.apache.lucene.index.FilteredTermsEnum.next(FilteredTermsEnum.java:232)
        at org.apache.lucene.index.Terms.getMax(Terms.java:169)
        at 
org.apache.lucene.util.LegacyNumericUtils.getMaxLong(LegacyNumericUtils.java:504)
        at 
org.apache.solr.update.VersionInfo.getMaxVersionFromIndex(VersionInfo.java:233)
        at 
org.apache.solr.update.UpdateLog.seedBucketsWithHighestVersion(UpdateLog.java:1584)
        at 
org.apache.solr.update.UpdateLog.seedBucketsWithHighestVersion(UpdateLog.java:1610)
        at org.apache.solr.core.SolrCore.seedVersionBuckets(SolrCore.java:949)
        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:931)
        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:776)
        at org.apache.solr.core.CoreContainer.create(CoreContainer.java:842)
        ... 8 more
{color}

It does not seem right that Solr Node itself should go down for such a problem.
# Error waiting for SolrCore to be created
java.util.concurrent.ExecutionException: org.apache.solr.common.SolrException: 
Unable to create core
# Unable to create core
# NumberFormatException: Invalid shift value (64) in prefixCoded bytes (is 
encoded value really an INT?)

i.e. Core creation fails because there was some confusion between long and 
integer.
If there is a data issue then somehow it should communicate it with an 
exception during ingestion.

\\
\\
*UPDATE*:
Another issue I see with the above problem is that solr cluster is completely 
inaccessible.
Solr-UI is also not coming up. I restarted the Solr servers and they refuse to 
recover.
I am not even able to delete the collections and create them afresh.
It seems the only way out is to do an *rm -rf* and re-install

Note that it is not related to network as I can ssh to the Solr machines and 
send messages to other Solr machines using nc


> Solr Replica goes down with NumberFormatException: Invalid shift value (64) 
> in prefixCoded bytes (is encoded value really an INT?)
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-10806
>                 URL: https://issues.apache.org/jira/browse/SOLR-10806
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: 6.3.1
>            Reporter: Sachin Goyal
>
> Our Solr nodes go down within 20-30 minutes of indexing.
> It does not seem that load-rate is too high because the exception in the logs 
> is pointing to a data problem:
> {color:darkred}
> INFO  - 2017-06-02 23:21:19.094; org.apache.solr.core.SolrCore; 
> \[node-instances_shard2_replica3\] Registered new searcher 
> Searcher@6740879c\[node-instances_shard2_replica3\] 
> main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_ne(6.3.0):C200591/8616:delGen=20)
>  Uninverting(_wx(6.3.0):C72132/697:delGen=5) 
> Uninverting(_y0(6.3.0):c5798/27:delGen=3) 
> Uninverting(_yv(6.3.0):c10935/827:delGen=2) 
> Uninverting(_z4(6.3.0):C4163/2277:delGen=1)))}
> ERROR - 2017-06-02 23:21:19.105; org.apache.solr.core.CoreContainer; Error 
> waiting for SolrCore to be created
> java.util.concurrent.ExecutionException: 
> org.apache.solr.common.SolrException: Unable to create core 
> \[node-instances_shard2_replica3\]
>         at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>         at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>         at 
> org.apache.solr.core.CoreContainer.lambda$load$1(CoreContainer.java:526)
>         at 
> org.apache.solr.core.CoreContainer$$Lambda$38/199449817.run(Unknown Source)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
>         at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$9/1611272577.run(Unknown
>  Source)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.solr.common.SolrException: Unable to create core 
> \[node-instances_shard2_replica3\]
>         at org.apache.solr.core.CoreContainer.create(CoreContainer.java:855)
>         at 
> org.apache.solr.core.CoreContainer.lambda$load$0(CoreContainer.java:498)
>         at 
> org.apache.solr.core.CoreContainer$$Lambda$37/1402433372.call(Unknown Source)
>         ... 6 more
> Caused by: java.lang.NumberFormatException: Invalid shift value (64) in 
> prefixCoded bytes (is encoded value really an INT?)
>         at 
> org.apache.lucene.util.LegacyNumericUtils.getPrefixCodedLongShift(LegacyNumericUtils.java:163)
>         at 
> org.apache.lucene.util.LegacyNumericUtils$1.accept(LegacyNumericUtils.java:392)
>         at 
> org.apache.lucene.index.FilteredTermsEnum.next(FilteredTermsEnum.java:232)
>         at org.apache.lucene.index.Terms.getMax(Terms.java:169)
>         at 
> org.apache.lucene.util.LegacyNumericUtils.getMaxLong(LegacyNumericUtils.java:504)
>         at 
> org.apache.solr.update.VersionInfo.getMaxVersionFromIndex(VersionInfo.java:233)
>         at 
> org.apache.solr.update.UpdateLog.seedBucketsWithHighestVersion(UpdateLog.java:1584)
>         at 
> org.apache.solr.update.UpdateLog.seedBucketsWithHighestVersion(UpdateLog.java:1610)
>         at org.apache.solr.core.SolrCore.seedVersionBuckets(SolrCore.java:949)
>         at org.apache.solr.core.SolrCore.<init>(SolrCore.java:931)
>         at org.apache.solr.core.SolrCore.<init>(SolrCore.java:776)
>         at org.apache.solr.core.CoreContainer.create(CoreContainer.java:842)
>         ... 8 more
> {color}
> It does not seem right that Solr Node itself should go down for such a 
> problem.
> # Error waiting for SolrCore to be created
> java.util.concurrent.ExecutionException: 
> org.apache.solr.common.SolrException: Unable to create core
> # Unable to create core
> # NumberFormatException: Invalid shift value (64) in prefixCoded bytes (is 
> encoded value really an INT?)
> i.e. Core creation fails because there was some confusion between long and 
> integer.
> If there is a data issue then somehow it should communicate it with an 
> exception during ingestion.
> \\
> \\
> *UPDATE*:
> Another issue I see with the above problem is that solr cluster is completely 
> inaccessible.
> Solr-UI is also not coming up. I restarted the Solr servers and they refuse 
> to recover.
> I am not even able to delete the collections and create them afresh.
> It seems the only way out is to do an *rm -rf* and re-install
> Note that it is not related to network as I can ssh to the Solr machines and 
> send messages to other Solr machines using nc
> \\
> \\
> *UPDATE 2*:
> I had a 24 node cluster with 2 collections.
> Each collection used  6 nodes and had 2 shard, 3 replica configuration.
> So 12 nodes used out of 24 nodes.
> Rest 12 nodes had Solr running with same zookeeper but no collections/cores.
> After the above errors begin to happen, Solr-UI of all 24 nodes became 
> unresponsive!
> So I tried the delete-collection API from the command line - no response.
> Ultimately I ran the delete-collection from the command line in a loop and it 
> deleted a part of the collection.
> Then I had to manually delete the *<coreName>/data/index/write.lock* file on 
> some nodes to purge those bad collections.
> Its been a few hours since then. There are no collections and still few nodes 
> are unresponsive with following messages in the logs:
> {color:brown}
> INFO  - 2017-06-03 06:40:51.308; org.apache.solr.core.SolrCore; Core 
> sync-status_shard1_replica2 is not yet closed, waiting 100 ms before checking 
> again.
> INFO  - 2017-06-03 06:40:51.408; org.apache.solr.core.SolrCore; Core 
> sync-status_shard1_replica2 is not yet closed, waiting 100 ms before checking 
> again.
> INFO  - 2017-06-03 06:40:51.508; org.apache.solr.core.SolrCore; Core 
> sync-status_shard1_replica2 is not yet closed, waiting 100 ms before checking 
> again.
> INFO  - 2017-06-03 06:40:51.608; org.apache.solr.core.SolrCore; Core 
> sync-status_shard1_replica2 is not yet closed, waiting 100 ms before checking 
> again.
> {color}
> It looks like a serious stability problem to me.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to