[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12531290
]
Hudson commented on NUTCH-25:
-
Integrated in Nutch-Nightly #222 (See
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12530796
]
Hudson commented on NUTCH-25:
-
Integrated in Nutch-Nightly #219 (See
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517170
]
Doğacan Güney commented on NUTCH-25:
At a very quick look, one potential drawback of the private EncodingClue +
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517066
]
Doug Cook commented on NUTCH-25:
Cool -- will take a look at the new patch (and will try to make stripGarbage
more
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515342
]
Doug Cook commented on NUTCH-25:
Doğacan,
Thanks for the quick feedback.
* EncodingDetector api is way too open.
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515365
]
Doğacan Güney commented on NUTCH-25:
[snip snip]
Internal to guessEncoding, we could certainly add the clue
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515461
]
Doug Cook commented on NUTCH-25:
Can you provide a link on icu4j's language detection?
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515026
]
Doug Cook commented on NUTCH-25:
OK, I've got more data, and a proposed solution.
I created a test set with a number
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514426
]
Doug Cook commented on NUTCH-25:
Not sure where this belongs architecturally and aesthetically -- will think
about
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514433
]
Doğacan Güney commented on NUTCH-25:
Doug, thanks for the (very) detailed feedback! This is incredibly helpful.
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514438
]
Doug Cook commented on NUTCH-25:
As far as the problem cases, I'm running a test now on my test DB (the ~60K doc
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514375
]
Doug Cook commented on NUTCH-25:
Hi, Doğacan.
My sincere apologies for the slow response, especially given the
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514377
]
Doug Cook commented on NUTCH-25:
I should also add that a significant number of the URLs seem to have been fixed
by
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514382
]
Doug Cook commented on NUTCH-25:
Oops, spoke to soon. On running a more extensive test, I saw quite a few
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507593
]
Doğacan Güney commented on NUTCH-25:
Doug, have you been able to look at my patch?
needs 'character encoding'
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498041
]
Doug Cook commented on NUTCH-25:
Thanks! I'll take a look at your proposed patch... (that was fast! ask and ye
shall
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12497507
]
Doug Cook commented on NUTCH-25:
We might want to think about raising the priority of this. I've seen encoding
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12497525
]
Ken Krugler commented on NUTCH-25:
--
I use [ICU|http://krugle.com/kse/projects/BYfaaku] for most issues like this.
[
http://issues.apache.org/jira/browse/NUTCH-25?page=comments#action_12376611 ]
Chris Fellows commented on NUTCH-25:
This was last updated May '05. Has this charset and language detection been
integrated into Nutch yet?
If not, at what point should
19 matches
Mail list logo