Great answer Thanks Michael. Yes the difference was too much > 1G Best regards
> On Nov 13, 2020, at 1:49 PM, Michael Sokolov <msoko...@gmail.com> wrote: > > You can't directly compare disk usage across two indexes, even with > the same data. Try re-indexing one of your datasets, and you will see > that the disk size is not the same. Mostly this is due to the way > segments are merged varying with some randomness from one run to > another, although the size of the difference you report is pretty > large, it is not out of the question that could occur, especially if > you have a large number of deletions or updates to existing documents. > If you want to get a more accurate idea of the amount of space taken > up by your index, you could try calling IndexWriter.forceMerge(1); > this will merge your index to a single segment, eliminating waste. It > is not generally recommended to do this for indexes you use for > querying, but it can be a useful tool for analysis. > >> On Fri, Nov 13, 2020 at 1:01 PM <baris.ka...@oracle.com> wrote: >> >> Nothing changed between two index generations except the data changed a >> bit as i described. >> >> When Lucene is done generating index, that is what i am reporting as the >> size of the directory where all index files are stored. >> >> I dont know about deleted docs? How do you trace that? yes the queries >> run exactly the same way (same number of results) most of the time the >> order is just changed which is fine; or some few different entries show >> up and i dont know why since lowecase filter should normalize even if >> original data casing changes. >> >> Yes absolutely sure nothing else changed. i kept all those things the >> same across two runs. >> >> actually does lucene repository have these kinda experiments accross >> versions (major or minor versions)? >> >> if i were lucene i would do these experiments to see the impact on index >> end results. this will help find out some potential un-indentified bugs. >> >> Methodology: >> >> have a large dataset like 15 million docs >> >> run index at each time a new version comes out with very common settings. >> >> >> i am not using solr, pure lucene 7.7.2. these info were in the other >> email here. let me copy paste here: >> >> >> >> ===== previous email ==== >> >> On a related issue: >> >> i experience that with Version 7.7.2 i experienced this: >> >> data is all lower case (same amount of docs as next case though) >> >> vs >> >> data is camel case except last word always in capital letters >> >> >> but i used in indexer the lowercase filter in both cases so indexing is >> done with all lower cases and i saw the first case's index size for case >> is like 9.5GB >> >> but same data size for second case was 11GB. >> >> >> what causes such difference and increase in index size? amount of docs >> are the same in both cases. >> >> >> Best regards >> >> >> >>> On 11/13/20 7:39 AM, Erick Erickson wrote: >>> What does “final finished sizes” mean? After optimize of just after >>> finishing all indexing? >>> The former is what counts here. >>> >>> And you provided no information on the number of deleted docs in the two >>> cases. Is >>> the number of deletedDocs the same (or close)? And does the q=*:* query >>> return the same numFound? >>> >>> Finally, are you absolutely and totally sure that no other options changed. >>> For instance, >>> you specified docValues=true for some field in one but not the other. Or >>> stored=true >>> etc. If you’re using the same schema. >>> >>> And you also haven’t provided information on what versions of Solr you’re >>> talking about. >>> You mention 7.7.2, but not the _other_ version of solr. If you’re going >>> from one major >>> version to another, sometimes defaults change for docValues on primitive >>> fields >>> especially. I’d consider firing up Luke and examining the field definitions >>> in >>> detail. >>> >>> Best, >>> Erick >>> >>>> On Nov 13, 2020, at 12:16 AM, baris.ka...@oracle.com wrote: >>>> >>>> Hi,- >>>> Thanks. >>>> These are final finished sizes in both cases. >>>> Best regards >>>> >>>> >>>>> On Nov 12, 2020, at 11:12 PM, Erick Erickson <erickerick...@gmail.com> >>>>> wrote: >>>>> >>>>> Yes, that issue is fixed. The “Resolution” tag is the key, it’s marked >>>>> “fixed” and the version is 8.0 >>>>> >>>>> As for your other question, index size is a very imprecise number. How >>>>> many deleted documents are there >>>>> in each case? Deleted documents take up disk space until the segments >>>>> containing them are merged away. >>>>> >>>>> Best, >>>>> Erick >>>>> >>>>>> On Nov 12, 2020, at 5:35 PM, baris.ka...@oracle.com wrote: >>>>>> >>>>>> https://urldefense.com/v3/__https://issues.apache.org/jira/browse/LUCENE-8448__;!!GqivPVa7Brio!I3RsAXIoDcPmpP_sc8C29vn8DcAXSvIgH7pvcxyDaBnfhdJAk24zPpQhqP035V1IJA$ >>>>>> >>>>>> >>>>>> Hi,- >>>>>> >>>>>> is this issue fixed please? Could You please help me figure it out? >>>>>> >>>>>> Best regards >>>>>> >>>>>> >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>>> >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org