[jira] [Commented] (SOLR-9050) IndexFetcher not retrying after SocketTimeoutException correctly, which leads to trying a full download again

2016-04-29 Thread Timothy Potter (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264572#comment-15264572
 ] 

Timothy Potter commented on SOLR-9050:
--

hmmm ... I reproduced the STE locally, but the request gets retried multiple 
times (as expected) locally, but I didn't see that in my prod env? Or maybe I 
just got incomplete logs from my ops team :P

> IndexFetcher not retrying after SocketTimeoutException correctly, which leads 
> to trying a full download again
> -
>
> Key: SOLR-9050
> URL: https://issues.apache.org/jira/browse/SOLR-9050
> Project: Solr
>  Issue Type: Bug
>  Components: replication (java)
>Affects Versions: 5.3.1
>Reporter: Timothy Potter
>Assignee: Timothy Potter
> Attachments: SOLR-9050.patch
>
>
> I'm seeing a problem where reading a large file from the leader (in SolrCloud 
> mode) during index replication leads to a SocketTimeoutException:
> {code}
> 2016-04-28 16:22:23.568 WARN  (RecoveryThread-foo_shard11_replica2) [c:foo 
> s:shard11 r:core_node139 x:foo_shard11_replica2] o.a.s.h.IndexFetcher Error 
> in fetching file: _405k.cfs (downloaded 7314866176 of 9990844536 bytes)
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:150)
> at java.net.SocketInputStream.read(SocketInputStream.java:121)
> at 
> org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
> at 
> org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
> at 
> org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
> at 
> org.apache.http.impl.io.ChunkedInputStream.getChunkSize(ChunkedInputStream.java:253)
> at 
> org.apache.http.impl.io.ChunkedInputStream.nextChunk(ChunkedInputStream.java:227)
> at 
> org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:186)
> at 
> org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)
> at 
> org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:80)
> at 
> org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:89)
> at 
> org.apache.solr.common.util.FastInputStream.read(FastInputStream.java:140)
> at 
> org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:167)
> at 
> org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:161)
> at 
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1312)
> at 
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1275)
> at 
> org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:800)
> {code}
> and this leads to the following error in cleanup:
> {code}
> 2016-04-28 16:26:04.332 ERROR (RecoveryThread-foo_shard11_replica2) [c:foo 
> s:shard11 r:core_node139 x:foo_shard11_replica2] o.a.s.h.ReplicationHandler 
> Index fetch failed :org.apache.solr.common.SolrException: Unable to download 
> _405k.cfs completely. Downloaded 7314866176!=9990844536
> at 
> org.apache.solr.handler.IndexFetcher$FileFetcher.cleanup(IndexFetcher.java:1406)
> at 
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1286)
> at 
> org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:800)
> at 
> org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:423)
> at 
> org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:254)
> at 
> org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:380)
> at 
> org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:162)
> at 
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:437)
> at 
> org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:227)
> 2016-04-28 16:26:04.332 ERROR (RecoveryThread-foo_shard11_replica2) [c:foo 
> s:shard11 r:core_node139 x:foo_shard11_replica2] o.a.s.c.RecoveryStrategy 
> Error while trying to recover:org.apache.solr.common.SolrException: 
> Replication for recovery failed.
> at 
> org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:165)
> at 
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:437)
> at 
> org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:227)
> {code}
> So a simple read timeout exception leads to re-downloading the whole index 
> again, and 

[jira] [Commented] (SOLR-9050) IndexFetcher not retrying after SocketTimeoutException correctly, which leads to trying a full download again

2016-04-29 Thread Timothy Potter (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15263708#comment-15263708
 ] 

Timothy Potter commented on SOLR-9050:
--

What's not clear to me is how we're getting into the cleanup method? I'm not 
seeing multiple retries to download the file after the first WARN is logged. 
There's about 4 minutes between the SocketTimeoutException and the Unable to 
download ERROR.

> IndexFetcher not retrying after SocketTimeoutException correctly, which leads 
> to trying a full download again
> -
>
> Key: SOLR-9050
> URL: https://issues.apache.org/jira/browse/SOLR-9050
> Project: Solr
>  Issue Type: Bug
>  Components: replication (java)
>Affects Versions: 5.3.1
>Reporter: Timothy Potter
>Assignee: Timothy Potter
>
> I'm seeing a problem where reading a large file from the leader (in SolrCloud 
> mode) during index replication leads to a SocketTimeoutException:
> {code}
> 2016-04-28 16:22:23.568 WARN  (RecoveryThread-foo_shard11_replica2) [c:foo 
> s:shard11 r:core_node139 x:foo_shard11_replica2] o.a.s.h.IndexFetcher Error 
> in fetching file: _405k.cfs (downloaded 7314866176 of 9990844536 bytes)
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:150)
> at java.net.SocketInputStream.read(SocketInputStream.java:121)
> at 
> org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
> at 
> org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
> at 
> org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
> at 
> org.apache.http.impl.io.ChunkedInputStream.getChunkSize(ChunkedInputStream.java:253)
> at 
> org.apache.http.impl.io.ChunkedInputStream.nextChunk(ChunkedInputStream.java:227)
> at 
> org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:186)
> at 
> org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)
> at 
> org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:80)
> at 
> org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:89)
> at 
> org.apache.solr.common.util.FastInputStream.read(FastInputStream.java:140)
> at 
> org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:167)
> at 
> org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:161)
> at 
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1312)
> at 
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1275)
> at 
> org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:800)
> {code}
> and this leads to the following error in cleanup:
> {code}
> 2016-04-28 16:26:04.332 ERROR (RecoveryThread-foo_shard11_replica2) [c:foo 
> s:shard11 r:core_node139 x:foo_shard11_replica2] o.a.s.h.ReplicationHandler 
> Index fetch failed :org.apache.solr.common.SolrException: Unable to download 
> _405k.cfs completely. Downloaded 7314866176!=9990844536
> at 
> org.apache.solr.handler.IndexFetcher$FileFetcher.cleanup(IndexFetcher.java:1406)
> at 
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1286)
> at 
> org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:800)
> at 
> org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:423)
> at 
> org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:254)
> at 
> org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:380)
> at 
> org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:162)
> at 
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:437)
> at 
> org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:227)
> 2016-04-28 16:26:04.332 ERROR (RecoveryThread-foo_shard11_replica2) [c:foo 
> s:shard11 r:core_node139 x:foo_shard11_replica2] o.a.s.c.RecoveryStrategy 
> Error while trying to recover:org.apache.solr.common.SolrException: 
> Replication for recovery failed.
> at 
> org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:165)
> at 
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:437)
> at 
> org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:227)
> {code}
> So a simple read timeout exception leads to re-downloading the whole index 
> again,