[ https://issues.apache.org/jira/browse/NUTCH-2381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel updated NUTCH-2381: ----------------------------------- Fix Version/s: 1.15 > In some situations the class TextProfileSignature gives different signatures > for the same text "profile" page. > -------------------------------------------------------------------------------------------------------------- > > Key: NUTCH-2381 > URL: https://issues.apache.org/jira/browse/NUTCH-2381 > Project: Nutch > Issue Type: Bug > Components: crawldb > Affects Versions: 1.13 > Reporter: Rodrigo Joni Sestari > Labels: signature > Fix For: 1.15 > > > In some situations the class TextProfileSignature gives different signatures > for the same text "profile" page. > The method TextProfileSignature.calculate uses a HashMap to salve the tokens, > after some process, the tokens come sorted by decreasing frequency. > For some pages like "http://curia.europa.eu/jcms/" the text "profile" is the > same but the signature come different for each fetch. > Its happens because the tokens are sorted only by decreasing frequency. > Tokens with the same frequency maybe not have the same order in different > fetchs. > The HashMap no guarantees as to the order of the map and not guarantee that > the order will remain constant over time. > My suggestion is change the methods TokenComparator.compare in order to sort > by frequency and Name. > Rodrigo -- This message was sent by Atlassian JIRA (v6.4.14#64029)