I'm using Solr compiled from a branch_4x checkout.

solr-impl    4.1-SNAPSHOT 1416639M - ncindex - 2012-12-03 12:54:38

I've noticed something really odd happening during DIH full-import of millions of documents, and I'm wondering if it's a bug. Configbits that I think may be relevant are below. If you'd like more information, please let me know what you'd like and whether I need to turn on settings like infostream and do another import:

Autocommit is set to maxDocs 65536 docs and maxTime 300000.
ramBufferSizeMB is 100.
updateLog is enabled, no options.

What's happening is that whenever it hits maxDocs, I get 2 segment files, one of them significantly smaller than the other. Rarely, it creates 3 segments! I know it's not a ramBuffer problem, because initially the exact same thing was happening with maxDocs at 100000 and a 32MB ramBuffer. I raised the ramBuffer and lowered the maxDocs. It takes significantly less than 5 minutes maxDocs to get indexed, so the maxTime value should not be a factor.

Sometimes the last segment is incomplete until the next autocommit, consisting only of files like the following. On the next autocommit, the incomplete segment is completed.

-rw-r--r-- 1 ncindex ncindex       411 Dec  3 14:22 _fu.si
-rw-r--r-- 1 ncindex ncindex     55966 Dec  3 14:22 _fu_Lucene41_0.tip
-rw-r--r-- 1 ncindex ncindex   1983125 Dec  3 14:22 _fu_Lucene41_0.tim
-rw-r--r-- 1 ncindex ncindex   1720492 Dec  3 14:22 _fu_Lucene41_0.pos
-rw-r--r-- 1 ncindex ncindex   1384931 Dec  3 14:22 _fu_Lucene41_0.doc

Sometimes the last segment does get written completely before the next autocommit. I have no idea what makes things happen differently sometimes:

-rw-r--r-- 1 ncindex ncindex    144497 Dec  3 14:16 _fq.tvx
-rw-r--r-- 1 ncindex ncindex   6106209 Dec  3 14:16 _fq.tvf
-rw-r--r-- 1 ncindex ncindex     18090 Dec  3 14:16 _fq.tvd
-rw-r--r-- 1 ncindex ncindex       411 Dec  3 14:16 _fq.si
-rw-r--r-- 1 ncindex ncindex     67683 Dec  3 14:16 _fq_Lucene41_0.tip
-rw-r--r-- 1 ncindex ncindex   2431846 Dec  3 14:16 _fq_Lucene41_0.tim
-rw-r--r-- 1 ncindex ncindex   2412246 Dec  3 14:16 _fq_Lucene41_0.pos
-rw-r--r-- 1 ncindex ncindex   1834286 Dec  3 14:16 _fq_Lucene41_0.doc
-rw-r--r-- 1 ncindex ncindex      1152 Dec  3 14:16 _fq.fdx
-rw-r--r-- 1 ncindex ncindex   2518453 Dec  3 14:16 _fq.fdt
-rw-r--r-- 1 ncindex ncindex   2518453 Dec  3 14:16 _fq.fdt

Every other segment is at least ten times as large as the others. It writes the large segment first. Here's an example of a large segment. Both of the segment listings above are from small segments:

-rw-r--r-- 1 ncindex ncindex 11289877 Dec  3 14:21 _ft.fdt
-rw-r--r-- 1 ncindex ncindex     7757 Dec  3 14:21 _ft.fdx
-rw-r--r-- 1 ncindex ncindex     3114 Dec  3 14:21 _ft.fnm
-rw-r--r-- 1 ncindex ncindex  8304619 Dec  3 14:21 _ft_Lucene41_0.doc
-rw-r--r-- 1 ncindex ncindex  9054058 Dec  3 14:21 _ft_Lucene41_0.pos
-rw-r--r-- 1 ncindex ncindex  9666900 Dec  3 14:21 _ft_Lucene41_0.tim
-rw-r--r-- 1 ncindex ncindex   244322 Dec  3 14:21 _ft_Lucene41_0.tip
-rw-r--r-- 1 ncindex ncindex      115 Dec  3 14:21 _ft_nrm.cfe
-rw-r--r-- 1 ncindex ncindex   170365 Dec  3 14:21 _ft_nrm.cfs
-rw-r--r-- 1 ncindex ncindex      411 Dec  3 14:21 _ft.si
-rw-r--r-- 1 ncindex ncindex   113554 Dec  3 14:21 _ft.tvd
-rw-r--r-- 1 ncindex ncindex 23374630 Dec  3 14:21 _ft.tvf
-rw-r--r-- 1 ncindex ncindex   908209 Dec  3 14:21 _ft.tvx

Thanks,
Shawn


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to