cutting 2004/03/29 14:30:40 Modified: docs fileformats.html whoweare.html xdocs fileformats.xml Log: Updated file format documentation to note skip data. Revision Changes Path 1.22 +48 -9 jakarta-lucene/docs/fileformats.html Index: fileformats.html =================================================================== RCS file: /home/cvs/jakarta-lucene/docs/fileformats.html,v retrieving revision 1.21 retrieving revision 1.22 diff -u -r1.21 -r1.22 --- fileformats.html 29 Mar 2004 12:46:36 -0000 1.21 +++ fileformats.html 29 Mar 2004 22:30:40 -0000 1.22 @@ -1332,9 +1332,18 @@ <p> TermInfoFile (.tis)--> - TermCount, TermInfos + TIVersion, TermCount, IndexInterval, SkipInterval, TermInfos + </p> + <p>TIVersion --> + UInt32 </p> <p>TermCount --> + UInt64 + </p> + <p>IndexInterval --> + UInt32 + </p> + <p>SkipInterval --> UInt32 </p> <p>TermInfos --> @@ -1357,6 +1366,9 @@ by the term's field name, and within that lexicographically by the term's text. </p> + <p>TIVersion names the version of the format + of this file and is -1 in Lucene 1.4. + </p> <p>Term text prefixes are shared. The PrefixLength is the number of initial characters from the previous term which must be pre-pended to a @@ -1389,7 +1401,7 @@ </p> <p> - This contains every 128th entry from the .tis + This contains every IndexInterval<sup>th</sup> entry from the .tis file, along with its location in the "tis" file. This is designed to be read entirely into memory and used to provide random access to the "tis" file. @@ -1440,6 +1452,7 @@ </p> <p>FreqFile (.frq) --> <TermFreqs><sup>TermCount</sup> + <SkipDatum><sup>TermCount/SkipInterval</sup> </p> <p>TermFreqs --> <TermFreq><sup>DocFreq</sup> @@ -1447,7 +1460,10 @@ <p>TermFreq --> DocDelta, Freq? </p> - <p>DocDelta,Freq --> + <p>SkipDatum --> + DocSkip,FreqSkip,ProxSkip + </p> + <p>DocDelta,Freq,DocSkip,FreqSkip,ProxSkip --> VInt </p> <p>TermFreqs @@ -1471,6 +1487,29 @@ <p> 15, 22, 3 </p> + <p>DocSkip records the document number before every + SkipInterval<sup>th</sup> document in TermFreqs. + Document numbers are represented as differences + from the previous value in the sequence. FreqSkip + and ProxSkip record the position of every + SkipInterval<sup>th</sup> entry in FreqFile and + ProxFile, respectively. File positions are + relative to the start of TermFreqs and Positions, + to the previous SkipDatum in the sequence. + </p> + <p>For example, if TermCount=35 and SkipInterval=16, + then there are two SkipData entries, containing + the 15<sup>th</sup> and 31<sup>st</sup> document + numbers in TermFreqs. The first FreqSkip names + the number of bytes after the beginning of + TermFreqs that the 16<sup>th</sup> SkipDatum + starts, and the second the number of bytes after + that that the 32<sup>nd</sup> starts. The first + ProxSkip names the number of bytes after the + beginning of Positions that the 16<sup>th</sup> + SkipDatum starts, and the second the number of + bytes after that that the 32<sup>nd</sup> starts. + </p> </blockquote> </td></tr> <tr><td><br/></td></tr> @@ -1588,8 +1627,8 @@ <p>This contains, for each document, a pointer to the document data in the Document (.tvd) file. </p> - <p>DocumentIndex (.tvx) --> FormatVersion<DocumentPosition><sup>NumDocs</sup></p> - <p>FormatVersion --> Int</p> + <p>DocumentIndex (.tvx) --> TVXVersion<DocumentPosition><sup>NumDocs</sup></p> + <p>TVXVersion --> Int</p> <p>DocumentPosition --> UInt64</p> <p>This is used to find the position of the Document in the .tvd file.</p> </li> @@ -1599,9 +1638,9 @@ term vector info and finally a list of pointers to the field information in the .tvf (Term Vector Fields) file.</p> <p> - Document (.tvd) --> FormatVersion<NumFields, FieldNums, FieldPositions,><sup>NumDocs</sup> + Document (.tvd) --> TVDVersion<NumFields, FieldNums, FieldPositions,><sup>NumDocs</sup> </p> - <p>FormatVersion --> Int</p> + <p>TVDVersion --> Int</p> <p>NumFields --> VInt</p> <p>FieldNums --> <FieldNumDelta><sup>NumFields</sup></p> <p>FieldNumDelta --> VInt</p> @@ -1614,8 +1653,8 @@ <p>The Field or .tvf file.</p> <p>This file contains, for each field that has a term vector stored, a list of the terms and their frequencies.</p> - <p>Field (.tvf) --> FormatVersion<NumTerms, NumDistinct, TermFreqs><sup>NumFields</sup></p> - <p>FormatVersion --> Int</p> + <p>Field (.tvf) --> TVFVersion<NumTerms, NumDistinct, TermFreqs><sup>NumFields</sup></p> + <p>TVFVersion --> Int</p> <p>NumTerms --> VInt</p> <p>NumDistinct --> VInt -- Future Use</p> <p>TermFreqs --> <TermText, TermFreq><sup>NumTerms</sup></p> 1.38 +1 -1 jakarta-lucene/docs/whoweare.html Index: whoweare.html =================================================================== RCS file: /home/cvs/jakarta-lucene/docs/whoweare.html,v retrieving revision 1.37 retrieving revision 1.38 diff -u -r1.37 -r1.38 --- whoweare.html 25 Mar 2004 13:24:22 -0000 1.37 +++ whoweare.html 29 Mar 2004 22:30:40 -0000 1.38 @@ -167,7 +167,7 @@ limited contract work.</p> </li> -<li><b>Otis Gospodnetić</b> (otis at apache.org)</li> +<li><b>Otis Gospodneti?</b> (otis at apache.org)</li> <li><b>Brian Goetz</b> (briangoetz at apache.org)</li> <li><b>Scott Ganyo</b> (scottganyo at apache.org)</li> <li><b>Eugene Gluzberg</b> (drag0n at apache.org)</li> 1.9 +49 -9 jakarta-lucene/xdocs/fileformats.xml Index: fileformats.xml =================================================================== RCS file: /home/cvs/jakarta-lucene/xdocs/fileformats.xml,v retrieving revision 1.8 retrieving revision 1.9 diff -u -r1.8 -r1.9 --- fileformats.xml 29 Mar 2004 12:46:36 -0000 1.8 +++ fileformats.xml 29 Mar 2004 22:30:40 -0000 1.9 @@ -905,9 +905,18 @@ <p> TermInfoFile (.tis)--> - TermCount, TermInfos + TIVersion, TermCount, IndexInterval, SkipInterval, TermInfos + </p> + <p>TIVersion --> + UInt32 </p> <p>TermCount --> + UInt64 + </p> + <p>IndexInterval --> + UInt32 + </p> + <p>SkipInterval --> UInt32 </p> <p>TermInfos --> @@ -930,6 +939,9 @@ by the term's field name, and within that lexicographically by the term's text. </p> + <p>TIVersion names the version of the format + of this file and is -1 in Lucene 1.4. + </p> <p>Term text prefixes are shared. The PrefixLength is the number of initial characters from the previous term which must be pre-pended to a @@ -962,7 +974,7 @@ </p> <p> - This contains every 128th entry from the .tis + This contains every IndexInterval<sup>th</sup> entry from the .tis file, along with its location in the "tis" file. This is designed to be read entirely into memory and used to provide random access to the "tis" file. @@ -1005,6 +1017,7 @@ </p> <p>FreqFile (.frq) --> <TermFreqs><sup>TermCount</sup> + <SkipDatum><sup>TermCount/SkipInterval</sup> </p> <p>TermFreqs --> <TermFreq><sup>DocFreq</sup> @@ -1012,7 +1025,10 @@ <p>TermFreq --> DocDelta, Freq? </p> - <p>DocDelta,Freq --> + <p>SkipDatum --> + DocSkip,FreqSkip,ProxSkip + </p> + <p>DocDelta,Freq,DocSkip,FreqSkip,ProxSkip --> VInt </p> <p>TermFreqs @@ -1036,6 +1052,30 @@ <p> 15, 22, 3 </p> + <p>DocSkip records the document number before every + SkipInterval<sup>th</sup> document in TermFreqs. + Document numbers are represented as differences + from the previous value in the sequence. FreqSkip + and ProxSkip record the position of every + SkipInterval<sup>th</sup> entry in FreqFile and + ProxFile, respectively. File positions are + relative to the start of TermFreqs and Positions, + to the previous SkipDatum in the sequence. + </p> + <p>For example, if TermCount=35 and SkipInterval=16, + then there are two SkipData entries, containing + the 15<sup>th</sup> and 31<sup>st</sup> document + numbers in TermFreqs. The first FreqSkip names + the number of bytes after the beginning of + TermFreqs that the 16<sup>th</sup> SkipDatum + starts, and the second the number of bytes after + that that the 32<sup>nd</sup> starts. The first + ProxSkip names the number of bytes after the + beginning of Positions that the 16<sup>th</sup> + SkipDatum starts, and the second the number of + bytes after that that the 32<sup>nd</sup> starts. + </p> + </subsection> <subsection name="Positions"> @@ -1127,8 +1167,8 @@ <p>This contains, for each document, a pointer to the document data in the Document (.tvd) file. </p> - <p>DocumentIndex (.tvx) --> FormatVersion<DocumentPosition><sup>NumDocs</sup></p> - <p>FormatVersion --> Int</p> + <p>DocumentIndex (.tvx) --> TVXVersion<DocumentPosition><sup>NumDocs</sup></p> + <p>TVXVersion --> Int</p> <p>DocumentPosition --> UInt64</p> <p>This is used to find the position of the Document in the .tvd file.</p> </li> @@ -1138,9 +1178,9 @@ term vector info and finally a list of pointers to the field information in the .tvf (Term Vector Fields) file.</p> <p> - Document (.tvd) --> FormatVersion<NumFields, FieldNums, FieldPositions,><sup>NumDocs</sup> + Document (.tvd) --> TVDVersion<NumFields, FieldNums, FieldPositions,><sup>NumDocs</sup> </p> - <p>FormatVersion --> Int</p> + <p>TVDVersion --> Int</p> <p>NumFields --> VInt</p> <p>FieldNums --> <FieldNumDelta><sup>NumFields</sup></p> <p>FieldNumDelta --> VInt</p> @@ -1153,8 +1193,8 @@ <p>The Field or .tvf file.</p> <p>This file contains, for each field that has a term vector stored, a list of the terms and their frequencies.</p> - <p>Field (.tvf) --> FormatVersion<NumTerms, NumDistinct, TermFreqs><sup>NumFields</sup></p> - <p>FormatVersion --> Int</p> + <p>Field (.tvf) --> TVFVersion<NumTerms, NumDistinct, TermFreqs><sup>NumFields</sup></p> + <p>TVFVersion --> Int</p> <p>NumTerms --> VInt</p> <p>NumDistinct --> VInt -- Future Use</p> <p>TermFreqs --> <TermText, TermFreq><sup>NumTerms</sup></p>
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]