Rodrigo Joni Sestari created NUTCH-2381:
-------------------------------------------

             Summary: In some situations the class TextProfileSignature gives 
different signatures for the same text "profile" page.
                 Key: NUTCH-2381
                 URL: https://issues.apache.org/jira/browse/NUTCH-2381
             Project: Nutch
          Issue Type: Bug
          Components: crawldb
    Affects Versions: 1.13
            Reporter: Rodrigo Joni Sestari


In some situations the class TextProfileSignature gives different signatures 
for the same text "profile" page.

The method TextProfileSignature.calculate uses a HashMap to salve the tokens, 
after some process, the tokens come sorted by decreasing frequency.

For some pages like "http://curia.europa.eu/jcms/"; the text "profile" is the 
same but the signature come different for each fetch.

Its happens because the tokens are sorted only by decreasing frequency. Tokens 
with the same frequency maybe not have the same order in different fetchs.

The HashMap no guarantees as to the order of the map and  not guarantee that 
the order will remain constant over time.

My suggestion is change the methods TokenComparator.compare  in order to sort 
by frequency and Name.

Rodrigo



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to