[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-09-29 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12531290 ] Hudson commented on NUTCH-25: - Integrated in Nutch-Nightly #222 (See

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-09-27 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12530796 ] Hudson commented on NUTCH-25: - Integrated in Nutch-Nightly #219 (See

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-08-02 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517170 ] Doğacan Güney commented on NUTCH-25: At a very quick look, one potential drawback of the private EncodingClue +

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-08-01 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517066 ] Doug Cook commented on NUTCH-25: Cool -- will take a look at the new patch (and will try to make stripGarbage more

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-25 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515342 ] Doug Cook commented on NUTCH-25: Doğacan, Thanks for the quick feedback. * EncodingDetector api is way too open.

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-25 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515365 ] Doğacan Güney commented on NUTCH-25: [snip snip] Internal to guessEncoding, we could certainly add the clue

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-25 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515461 ] Doug Cook commented on NUTCH-25: Can you provide a link on icu4j's language detection?

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-24 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515026 ] Doug Cook commented on NUTCH-25: OK, I've got more data, and a proposed solution. I created a test set with a number

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-21 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514426 ] Doug Cook commented on NUTCH-25: Not sure where this belongs architecturally and aesthetically -- will think about

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-21 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514433 ] Doğacan Güney commented on NUTCH-25: Doug, thanks for the (very) detailed feedback! This is incredibly helpful.

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-21 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514438 ] Doug Cook commented on NUTCH-25: As far as the problem cases, I'm running a test now on my test DB (the ~60K doc

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-20 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514375 ] Doug Cook commented on NUTCH-25: Hi, Doğacan. My sincere apologies for the slow response, especially given the

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-20 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514377 ] Doug Cook commented on NUTCH-25: I should also add that a significant number of the URLs seem to have been fixed by

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-20 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514382 ] Doug Cook commented on NUTCH-25: Oops, spoke to soon. On running a more extensive test, I saw quite a few

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-06-23 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507593 ] Doğacan Güney commented on NUTCH-25: Doug, have you been able to look at my patch? needs 'character encoding'

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-05-22 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498041 ] Doug Cook commented on NUTCH-25: Thanks! I'll take a look at your proposed patch... (that was fast! ask and ye shall

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-05-21 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12497507 ] Doug Cook commented on NUTCH-25: We might want to think about raising the priority of this. I've seen encoding

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-05-21 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12497525 ] Ken Krugler commented on NUTCH-25: -- I use [ICU|http://krugle.com/kse/projects/BYfaaku] for most issues like this.

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2006-04-26 Thread Chris Fellows (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-25?page=comments#action_12376611 ] Chris Fellows commented on NUTCH-25: This was last updated May '05. Has this charset and language detection been integrated into Nutch yet? If not, at what point should