[jira] Commented: (LUCENE-2482) Index sorter

Robert Muir (JIRA) Sun, 16 Jan 2011 14:55:09 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12982411#action_12982411
 ]


Robert Muir commented on LUCENE-2482:
-------------------------------------

bq. I'm not sure if I follow your use case though ... please remember that this 
re-sorting is applied exactly the same to all postings, so savings on one list 
may cause bloat on another list.

Hi Andrzej, I came across this the other day, and thought it would be really 
interesting in the context of some of our newer codecs
under development in trunk and the bulkpostings branch.

I found the results presented there based on index sorting for codecs like 
simple9 to be really compelling, significant reduction
in bits/posting for docids especially, because it can pack a lot of small 
deltas efficiently.

{noformat}
The ﬁrst method reorders the documents in a text collection based on the number 
of
distinct terms contained in each document. The idea is that two documents that 
each
contain a large number of distinct terms are more likely to share terms than 
are a
document with many distinct terms and a document with few distinct terms. 
Therefore,
by assigning docids so that documents with many terms are close together, we may
expect a greater clustering eﬀect than by assigning docids at random.

The second method assumes that the documents have been crawled from the Web (or
maybe a corporate Intranet). It reassigns docids in lexicographical order of 
URL. The
idea here is that two documents from the same Web server (or maybe even from the
same directory on that server) are more likely to share common terms than two 
random
documents from unrelated locations on the Internet.
{noformat}

http://www.ir.uwaterloo.ca/book/06-index-compression.pdf (see page 214: doc id 
reordering)


> Index sorter
> ------------
>
>                 Key: LUCENE-2482
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2482
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>    Affects Versions: 3.1, 4.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>             Fix For: 3.1, 4.0
>
>         Attachments: indexSorter.patch
>
>
> A tool to sort index according to a float document weight. Documents with 
> high weight are given low document numbers, which means that they will be 
> first evaluated. When using a strategy of "early termination" of queries (see 
> TimeLimitedCollector) such sorting significantly improves the quality of 
> partial results.
> (Originally this tool was created by Doug Cutting in Nutch, and used norms as 
> document weights - thus the ordering was limited by the limited resolution of 
> norms. This is a pure Lucene version of the tool, and it uses arbitrary 
> floats from a specified stored field).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2482) Index sorter

Reply via email to