Re: corruption when indexing large number of documents (4 billion+)

Mark Walkom Fri, 09 Jan 2015 02:00:08 -0800

Honestly, with this sort of scale you should be thinking about support
(disclaimer: I work for Elasticsearch support).


However let's see what we can do;
What version of ES, java?
What are you using to monitor your cluster?
How many GB is that index?
Is it in one massive index?
How many GB in your data in total?
Why do you have 2 replicas?
Are you searching while indexing, or just indexing the data? If it's the
latter then you might want to try disabling replica's and then setting the
index refresh rate to -1 for the index, insert your data, and then turn
refresh back on and then let the data index. That's best practice for large
amounts of indexing.

Also, consider dropping your bulk size down to 5K, that's generally
considered the upper limit for bulk API batches.

On 9 January 2015 at 14:44, Darshat Shah <[email protected]> wrote:

> Hi,
> We have a 98 node cluster of ES with each node 32GB RAM. 16GB is reserved
> for ES via config file. The index has 98 shards with 2 replicas.
>
> On this cluster we are loading a large number of documents (when done it
> would be about 10 billion). About 40million documents are generated per
> hour and we are pre-loading several days worth of documents to prototype
> how ES will scale, and its query performance.
>
> Right now we are facing problems getting data pre-loaded. Indexing is
> turned off. We use NEST client, with batch size of 10k. To speed up data
> load, we distribute the hourly data to each of the 98 nodes to insert in
> parallel. This worked ok for a few hours till we got 4.5B documents in the
> cluster.
>
> After that the cluster state went to red. The outstanding tasks CAT API
> shows errors like below. CPU/Disk/Memory seems to be fine on the nodes.
>
> Why are we getting these errors and is there a best practice? any help
> greatly appreciated since this blocks prototyping ES for our use case.
>
> thanks
> Darshat
>
> Sample errors:
>
> source               : shard-failed ([agora_v1][24],
>                        node[00ihc1ToRiqMDJ1lou1Sig], [R],
> s[INITIALIZING]),
>                        reason [Failed to start shard, message
>                        [RecoveryFailedException[[agora_v1][24]: Recovery
>                        failed from [Shingen
> Harada][RDAwqX9yRgud9f7YtZAJPg][CH1
>                        SCH060051438][inet[/10.46.153.84:9300]] into
> [Elfqueen][
>
>  00ihc1ToRiqMDJ1lou1Sig][CH1SCH050053435][inet[/10.46.182
>                        .106:9300]]]; nested:
> RemoteTransportException[[Shingen
>                        
> Harada][inet[/10.46.153.84:9300]][internal:index/shard/r
>
>                        ecovery/start_recovery]]; nested:
>                        RecoveryEngineException[[agora_v1][24] Phase[1]
>                        Execution failed]; nested:
>                        RecoverFilesRecoveryException[[agora_v1][24] Failed
> to
>                        transfer [0] files with total size of [0b]];
> nested: NoS
>
>  uchFileException[D:\app\ES.ElasticSearch_v010\elasticsea
>
>  rch-1.4.1\data\AP-elasticsearch\nodes\0\indices\agora_v1
>                        \24\index\segments_6r]; ]]
>
>
> AND
>
> source               : shard-failed ([agora_v1][95],
>                        node[PUsHFCStRaecPA6MuvJV9g], [P],
> s[INITIALIZING]),
>                        reason [Failed to start shard, message
>                        [IndexShardGatewayRecoveryException[[agora_v1][95]
>                        failed to fetch index version after copying it
> over];
>                        nested: CorruptIndexException[[agora_v1][95]
>                        Preexisting corrupted index
>                        [corrupted_1wegvS7BSKSbOYQkX9zJSw] caused by:
>                        CorruptIndexException[Read past EOF while reading
>                        segment infos]
>                            EOFException[read past EOF:
> MMapIndexInput(path="D:\
>
>  app\ES.ElasticSearch_v010\elasticsearch-1.4.1\data\AP-el
>
>  asticsearch\nodes\0\indices\agora_v1\95\index\segments_1
>                        1j")]
>                        org.apache.lucene.index.CorruptIndexException: Read
>                        past EOF while reading segment infos
>                            at
> org.elasticsearch.index.store.Store.readSegmentsI
>                        nfo(Store.java:127)
>                            at
> org.elasticsearch.index.store.Store.access$400(St
>                        ore.java:80)
>                            at
> org.elasticsearch.index.store.Store$MetadataSnaps
>                        hot.buildMetadata(Store.java:575)
> ---snip more stack trace-----
>
>
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/0f24b939-2cba-41a9-8de8-49565f77e567%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/0f24b939-2cba-41a9-8de8-49565f77e567%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_F07YZO5%2BDpRAyUE-RM5K0vs4JWQY3j8chbmjyzW7eng%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: corruption when indexing large number of documents (4 billion+)

Reply via email to