[jira] Commented: (LUCENE-252) [PATCH] Problem with Sort logic on tokenized fields

Enis Soztutar (JIRA) Fri, 09 Mar 2007 02:28:47 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12479552
 ]


Enis Soztutar commented on LUCENE-252:
--------------------------------------

I also agree with tokenized field caching, which is a use case for nutch. Let 
me elaborate on the use case. In a nutch deployment, we generate indexes from 
the web documents, and indeed the set of fields is known a priori. Then the 
indexes are distributed to several index servers running on hadoop's RPC calls. 
Then the query is sent to all of the index servers, the results are collected 
and merged on the fly. Since the indexes need not be disjoint(since crawling is 
an adaptive process) the results should be merged, without having a document 
more then once. So we need a unique key to represent the document. Default 
nutch codebase uses the site field(url's hostname), which is untokenized for 
such a task, and allow only 1 - 2 documents from a site in the search results. 
For obvious performance reasons, the site field is cached in the index servers 
with FieldCache.getStrings(). The problem arises when we want to show more than 
one result from a specific site (for example in a site:apache.org query ), and 
if we have the same url indexed in more than one index server. We use the 
tokenized url field in the FieldCache, then deleting duplicates becomes error 
prone. Since we use FieldCache.getStrings() rather that 
FieldCache.getStringIndex(), the problem here is not tokenized field sorting, 
but tokenized field not caching correctly, an example of which is an array like 
[com, edu. www, youtube, ] from the getStrings() method(for each doc, only a 
token is returned, rather than the whole url). 

Well, if you are still with me, here is my proposal : 

1. in FieldCacheImpl.java in both getStrings and getStringIndex functions add 

Field docField = getField(reader, field);
      if (docField != null && docField.isStored() && docField.isTokenized()) {
           throw new RuntimeException("Caching in Tokenized Fields is not 
allowed");
      } 

2. subclass FieldCacheImpl as StoredFieldCacheImpl and implement stored field 
caching there, delegating untokenized fields to super class
3. add the implementation to FieldCache.java :

 public static FieldCache DEFAULT = new FieldCacheImpl();
 public static FieldCache STORED_CACHE = new StoredCacheImpl();

this way both lucene internals will not be affected and a stored field caching 
could be performed. 



> [PATCH] Problem with Sort logic on tokenized fields
> ---------------------------------------------------
>
>                 Key: LUCENE-252
>                 URL: https://issues.apache.org/jira/browse/LUCENE-252
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 1.4
>         Environment: Operating System: other
> Platform: All
>            Reporter: Aviran Mordo
>         Assigned To: Lucene Developers
>         Attachments: dif.txt, 
> FieldCacheImpl_Tokenized_fields_lucene_2.0.patch, 
> FieldCacheImpl_Tokenized_fields_lucene_2.0_v1.1.patch, 
> FieldCacheImpl_Tokenized_fields_lucene_2.2-dev.patch
>
>
> When you set s SortField to a Text field which gets tokenized
> FieldCacheImpl uses the term to do the sort, but then sorting is off 
> especially with more then one word in the field. I think it is much 
> more logical to sort by field's string value if the sort field is Tokenized 
> and
> stored. This way you'll get the CORRECT sort order

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-252) [PATCH] Problem with Sort logic on tokenized fields

Reply via email to