[ 
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558418#comment-13558418
 ] 

Adrien Grand commented on LUCENE-4599:
--------------------------------------

I tried to compute the compression ratio of the term vector files compared to 
Lucene40TVF for small docs (the wikipedia 1K docs) based on the chunk size (the 
patch has 2^14 as a default chunk size):
|| Chunk size || no options || positions + offsets ||
| 2^7 | 0.79 | 0.68 |
| 2^8 | 0.79 | 0.68 |
| 2^9 | 0.75 | 0.66 |
| 2^10| 0.73 | 0.65 |
| 2^11| 0.70 | 0.63 |
| 2^12| 0.68 | 0.62 |
| 2^13| 0.65 | 0.60 |
| 2^14| 0.63 | 0.59 |
| 2^15| 0.62 | 0.58 |
| 2^16| 0.62 | 0.59 |
| 2^17| 0.62 | 0.58 |

Interestingly, raising the chunk size above 2^14 doesn't bring much. 2^11 or 
2^12 look like good candidates for the default size if we were to make this TVF 
the default one (making big documents likely to be alone in their chunks and 
preventing small docs from raising the compression ratio).


                
> Compressed term vectors
> -----------------------
>
>                 Key: LUCENE-4599
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4599
>             Project: Lucene - Core
>          Issue Type: Task
>          Components: core/codecs, core/termvectors
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 4.2
>
>         Attachments: 4599-dataimport-fail.log, 4599-zookeer-fail.log, 
> CompressingTVF_ingest_rate.png, highlightNoStop.tasks, 
> Lucene40TVF_ingest_rate.png, LUCENE-4599.patch, LUCENE-4599.patch, 
> LUCENE-4599.patch, solr.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with 
> stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to