[jira] [Updated] (SOLR-9050) IndexFetcher not retrying after SocketTimeoutException correctly, which leads to trying a full download again

2016-04-29 Thread Timothy Potter (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Potter updated SOLR-9050:
-
Attachment: SOLR-9050.patch

Updated patch (against 5.3.1 branch since that's the one I'm having issues with 
in prod). I've replicated the SocketTimeoutException locally but the 
IndexFetcher retries as expected locally, but I'm not seeing that in my prod 
server logs?

This patch uses the distribUpdate Timeouts from UpdateShardHandler (config in 
solr.xml) and adds some better logging so we can get a better picture of what's 
happening with these re-downloads after a STE.

> IndexFetcher not retrying after SocketTimeoutException correctly, which leads 
> to trying a full download again
> -
>
> Key: SOLR-9050
> URL: https://issues.apache.org/jira/browse/SOLR-9050
> Project: Solr
>  Issue Type: Bug
>  Components: replication (java)
>Affects Versions: 5.3.1
>Reporter: Timothy Potter
>Assignee: Timothy Potter
> Attachments: SOLR-9050.patch, SOLR-9050.patch
>
>
> I'm seeing a problem where reading a large file from the leader (in SolrCloud 
> mode) during index replication leads to a SocketTimeoutException:
> {code}
> 2016-04-28 16:22:23.568 WARN  (RecoveryThread-foo_shard11_replica2) [c:foo 
> s:shard11 r:core_node139 x:foo_shard11_replica2] o.a.s.h.IndexFetcher Error 
> in fetching file: _405k.cfs (downloaded 7314866176 of 9990844536 bytes)
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:150)
> at java.net.SocketInputStream.read(SocketInputStream.java:121)
> at 
> org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
> at 
> org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
> at 
> org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
> at 
> org.apache.http.impl.io.ChunkedInputStream.getChunkSize(ChunkedInputStream.java:253)
> at 
> org.apache.http.impl.io.ChunkedInputStream.nextChunk(ChunkedInputStream.java:227)
> at 
> org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:186)
> at 
> org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)
> at 
> org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:80)
> at 
> org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:89)
> at 
> org.apache.solr.common.util.FastInputStream.read(FastInputStream.java:140)
> at 
> org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:167)
> at 
> org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:161)
> at 
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1312)
> at 
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1275)
> at 
> org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:800)
> {code}
> and this leads to the following error in cleanup:
> {code}
> 2016-04-28 16:26:04.332 ERROR (RecoveryThread-foo_shard11_replica2) [c:foo 
> s:shard11 r:core_node139 x:foo_shard11_replica2] o.a.s.h.ReplicationHandler 
> Index fetch failed :org.apache.solr.common.SolrException: Unable to download 
> _405k.cfs completely. Downloaded 7314866176!=9990844536
> at 
> org.apache.solr.handler.IndexFetcher$FileFetcher.cleanup(IndexFetcher.java:1406)
> at 
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1286)
> at 
> org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:800)
> at 
> org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:423)
> at 
> org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:254)
> at 
> org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:380)
> at 
> org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:162)
> at 
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:437)
> at 
> org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:227)
> 2016-04-28 16:26:04.332 ERROR (RecoveryThread-foo_shard11_replica2) [c:foo 
> s:shard11 r:core_node139 x:foo_shard11_replica2] o.a.s.c.RecoveryStrategy 
> Error while trying to recover:org.apache.solr.common.SolrException: 
> Replication for recovery failed.
> at 
> org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:165)
> at 
> 

[jira] [Updated] (SOLR-9050) IndexFetcher not retrying after SocketTimeoutException correctly, which leads to trying a full download again

2016-04-29 Thread Timothy Potter (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Potter updated SOLR-9050:
-
Attachment: SOLR-9050.patch

Start of a patch that does not yet solve anything. It logs the error in the 
fetchPackets method in case the exception gets squelched by the exception 
raised in the cleanup method and tries to use the timeouts from solr.xml. I 
need to come up with a way to reproduce this locally w/o huge indexes to try to 
see what's really going on.

> IndexFetcher not retrying after SocketTimeoutException correctly, which leads 
> to trying a full download again
> -
>
> Key: SOLR-9050
> URL: https://issues.apache.org/jira/browse/SOLR-9050
> Project: Solr
>  Issue Type: Bug
>  Components: replication (java)
>Affects Versions: 5.3.1
>Reporter: Timothy Potter
>Assignee: Timothy Potter
> Attachments: SOLR-9050.patch
>
>
> I'm seeing a problem where reading a large file from the leader (in SolrCloud 
> mode) during index replication leads to a SocketTimeoutException:
> {code}
> 2016-04-28 16:22:23.568 WARN  (RecoveryThread-foo_shard11_replica2) [c:foo 
> s:shard11 r:core_node139 x:foo_shard11_replica2] o.a.s.h.IndexFetcher Error 
> in fetching file: _405k.cfs (downloaded 7314866176 of 9990844536 bytes)
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:150)
> at java.net.SocketInputStream.read(SocketInputStream.java:121)
> at 
> org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
> at 
> org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
> at 
> org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
> at 
> org.apache.http.impl.io.ChunkedInputStream.getChunkSize(ChunkedInputStream.java:253)
> at 
> org.apache.http.impl.io.ChunkedInputStream.nextChunk(ChunkedInputStream.java:227)
> at 
> org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:186)
> at 
> org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)
> at 
> org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:80)
> at 
> org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:89)
> at 
> org.apache.solr.common.util.FastInputStream.read(FastInputStream.java:140)
> at 
> org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:167)
> at 
> org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:161)
> at 
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1312)
> at 
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1275)
> at 
> org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:800)
> {code}
> and this leads to the following error in cleanup:
> {code}
> 2016-04-28 16:26:04.332 ERROR (RecoveryThread-foo_shard11_replica2) [c:foo 
> s:shard11 r:core_node139 x:foo_shard11_replica2] o.a.s.h.ReplicationHandler 
> Index fetch failed :org.apache.solr.common.SolrException: Unable to download 
> _405k.cfs completely. Downloaded 7314866176!=9990844536
> at 
> org.apache.solr.handler.IndexFetcher$FileFetcher.cleanup(IndexFetcher.java:1406)
> at 
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1286)
> at 
> org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:800)
> at 
> org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:423)
> at 
> org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:254)
> at 
> org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:380)
> at 
> org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:162)
> at 
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:437)
> at 
> org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:227)
> 2016-04-28 16:26:04.332 ERROR (RecoveryThread-foo_shard11_replica2) [c:foo 
> s:shard11 r:core_node139 x:foo_shard11_replica2] o.a.s.c.RecoveryStrategy 
> Error while trying to recover:org.apache.solr.common.SolrException: 
> Replication for recovery failed.
> at 
> org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:165)
> at 
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:437)
> at 
>