[
https://issues.apache.org/jira/browse/NUTCH-2381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16939372#comment-16939372
]
ASF GitHub Bot commented on NUTCH-2381:
---------------------------------------
sebastian-nagel commented on pull request #473: NUTCH-2381 In some situations
the class TextProfileSignature gives different signatures for the same text
"profile" page
URL: https://github.com/apache/nutch/pull/473
- implement secondary sorting, similar to patch provided by Rodrigo Joni
Sestari
- allow to restore previous behavior by setting property
`db.signature.text_profile.sec_sort_lex = false`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> In some situations the class TextProfileSignature gives different signatures
> for the same text "profile" page.
> --------------------------------------------------------------------------------------------------------------
>
> Key: NUTCH-2381
> URL: https://issues.apache.org/jira/browse/NUTCH-2381
> Project: Nutch
> Issue Type: Bug
> Components: crawldb
> Affects Versions: 1.13
> Reporter: Rodrigo Joni Sestari
> Assignee: Sebastian Nagel
> Priority: Major
> Labels: signature
> Fix For: 1.16
>
>
> In some situations the class TextProfileSignature gives different signatures
> for the same text "profile" page.
> The method TextProfileSignature.calculate uses a HashMap to salve the tokens,
> after some process, the tokens come sorted by decreasing frequency.
> For some pages like "http://curia.europa.eu/jcms/" the text "profile" is the
> same but the signature come different for each fetch.
> Its happens because the tokens are sorted only by decreasing frequency.
> Tokens with the same frequency maybe not have the same order in different
> fetchs.
> The HashMap no guarantees as to the order of the map and not guarantee that
> the order will remain constant over time.
> My suggestion is change the methods TokenComparator.compare in order to sort
> by frequency and Name.
> Rodrigo
--
This message was sent by Atlassian Jira
(v8.3.4#803005)