Great answer
Thanks Michael.

Yes the difference was too much > 1G
Best regards

> On Nov 13, 2020, at 1:49 PM, Michael Sokolov <msoko...@gmail.com> wrote:
> 
> You can't directly compare disk usage across two indexes, even with
> the same data. Try re-indexing one of your datasets, and you will see
> that the disk size is not the same. Mostly this is due to the way
> segments are merged varying with some randomness from one run to
> another, although the size of the difference you report is pretty
> large, it is not out of the question that could occur, especially if
> you have a large number of deletions or updates to existing documents.
> If you want to get a more accurate idea of the amount of space taken
> up by your index, you could try calling IndexWriter.forceMerge(1);
> this will merge your index to a single segment, eliminating waste. It
> is not generally recommended to do this for indexes you use for
> querying, but it can be a useful tool for analysis.
> 
>> On Fri, Nov 13, 2020 at 1:01 PM <baris.ka...@oracle.com> wrote:
>> 
>> Nothing changed between two index generations except the data changed a
>> bit as i described.
>> 
>> When Lucene is done generating index, that is what i am reporting as the
>> size of the directory where all index files are stored.
>> 
>> I dont know about deleted docs? How do you trace that? yes the queries
>> run exactly the same way (same number of results) most of the time the
>> order is just changed which is fine; or some few different entries show
>> up and i dont know why since lowecase filter should normalize even if
>> original data casing changes.
>> 
>> Yes absolutely sure nothing else changed. i kept all those things the
>> same across two runs.
>> 
>> actually does lucene repository have these kinda experiments accross
>> versions (major or minor versions)?
>> 
>> if i were lucene i would do these experiments to see the impact on index
>> end results. this will help find out some potential un-indentified bugs.
>> 
>> Methodology:
>> 
>> have a large dataset like 15 million docs
>> 
>> run index at each time a new version comes out with very common settings.
>> 
>> 
>> i am not using solr, pure lucene 7.7.2. these info were in the other
>> email here. let me copy paste here:
>> 
>> 
>> 
>> ===== previous email ====
>> 
>> On a related issue:
>> 
>> i experience that with Version 7.7.2 i experienced this:
>> 
>> data is all lower case (same amount of docs as next case though)
>> 
>> vs
>> 
>> data is camel case except last word always in capital letters
>> 
>> 
>> but i used in indexer the lowercase filter in both cases so indexing is
>> done with all lower cases and i saw the first case's index size for case
>> is like 9.5GB
>> 
>> but same data size for second case was 11GB.
>> 
>> 
>> what causes such difference and increase in index size? amount of docs
>> are the same in both cases.
>> 
>> 
>> Best regards
>> 
>> 
>> 
>>> On 11/13/20 7:39 AM, Erick Erickson wrote:
>>> What does “final finished sizes” mean? After optimize of just after 
>>> finishing all indexing?
>>> The former is what counts here.
>>> 
>>> And you provided no information on the number of deleted docs in the two 
>>> cases. Is
>>> the number of deletedDocs the same (or close)? And does the q=*:* query
>>> return the same numFound?
>>> 
>>> Finally, are you absolutely and totally sure that no other options changed. 
>>> For instance,
>>> you specified docValues=true for some field in one but not the other. Or 
>>> stored=true
>>> etc. If you’re using the same schema.
>>> 
>>> And you also haven’t provided information on what versions of Solr you’re 
>>> talking about.
>>> You mention 7.7.2, but not the _other_ version of solr. If you’re going 
>>> from one major
>>> version to another, sometimes defaults change for docValues on primitive 
>>> fields
>>> especially. I’d consider firing up Luke and examining the field definitions 
>>> in
>>> detail.
>>> 
>>> Best,
>>> Erick
>>> 
>>>> On Nov 13, 2020, at 12:16 AM, baris.ka...@oracle.com wrote:
>>>> 
>>>> Hi,-
>>>> Thanks.
>>>> These are final finished sizes in both cases.
>>>> Best regards
>>>> 
>>>> 
>>>>> On Nov 12, 2020, at 11:12 PM, Erick Erickson <erickerick...@gmail.com> 
>>>>> wrote:
>>>>> 
>>>>> Yes, that issue is fixed. The “Resolution” tag is the key, it’s marked 
>>>>> “fixed” and the version is 8.0
>>>>> 
>>>>> As for your other question, index size is a very imprecise number. How 
>>>>> many deleted documents are there
>>>>> in each case? Deleted documents take up disk space until the segments 
>>>>> containing them are merged away.
>>>>> 
>>>>> Best,
>>>>> Erick
>>>>> 
>>>>>> On Nov 12, 2020, at 5:35 PM, baris.ka...@oracle.com wrote:
>>>>>> 
>>>>>> https://urldefense.com/v3/__https://issues.apache.org/jira/browse/LUCENE-8448__;!!GqivPVa7Brio!I3RsAXIoDcPmpP_sc8C29vn8DcAXSvIgH7pvcxyDaBnfhdJAk24zPpQhqP035V1IJA$
>>>>>> 
>>>>>> 
>>>>>> Hi,-
>>>>>> 
>>>>>> is this issue fixed please? Could You please help me figure it out?
>>>>>> 
>>>>>> Best regards
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to