[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515230 ]
Doğacan Güney commented on NUTCH-25: ------------------------------------ Overall I think the idea behind EncodingDetector is very solid. I will take a better look at your patch, but here are a couple of comments after a quick review: * EncodingDetector api is way too open. IMO, EncodingClue should be a private static class (users can pass a clue like detector.addClue(value, source, confidence)), EncodingDetector should not expose clues ever (for example, autoDetectClues should return void [or perhaps a boolean indicating the success of autodetect]) and store clues internally. * code: public boolean meetsThreshold() { Integer mt = (Integer) thresholds.get(value); int myThreshold = (mt != null) ? mt.intValue() : minConfidence; // use global value if no encoding-specific value found return (confidence < 0 || (minConfidence >= 0 && confidence>=myThreshold)); } Why does meetsTreshold return true if confidence < 0? * If users specify an encoding clue with no confidence then we should give it a default positive confidence instead of -1. Of course, confidence value needs to be very very small, maybe just +1. * It would be nice to "stack" clues. Assume that autodetection returned 2 possible encodings: ISO-8859-1 with 50 confidence and UTF-8 with 45 confidence. If I add a new clue (say, coming from http header) for UTF-8 with +6 confidence, overall confidence for UTF-8 should now be 51. * This is mostly my personal nit, but Java 5 style generics would be nice. About contributing stuff back: The article at http://wiki.apache.org/nutch/HowToContribute is a good starting point but it assumes that you will be working on trunk. I am not sure how you can 'forward-port' your changes from an older version besides doing it manually. One approach may be to first backport a part of the trunk to your local installation, change the code then do a "diff -pu" (against backported version). Since trunk contains newer features and bug fixes you will also be getting them for free this way :). > needs 'character encoding' detector > ----------------------------------- > > Key: NUTCH-25 > URL: https://issues.apache.org/jira/browse/NUTCH-25 > Project: Nutch > Issue Type: New Feature > Reporter: Stefan Groschupf > Assignee: Doğacan Güney > Fix For: 1.0.0 > > Attachments: EncodingDetector.java, NUTCH-25.patch, > NUTCH-25_draft.patch, patch > > > transferred from: > http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356 > submitted by: > Jungshik Shin > this is a follow-up to bug 993380 (figure out 'charset' > from the meta tag). > Although we can cover a lot of ground using the 'C-T' > header field in in the HTTP header and the > corresponding meta tag in html documents (and in case > of XML, we have to use a similar but a different > 'parsing'), in the wild, there are a lot of documents > without any information about the character encoding > used. Browsers like Mozilla and search engines like > Google use character encoding detectors to deal with > these 'unlabelled' documents. > Mozilla's character encoding detector is GPL/MPL'd and > we might be able to port it to Java. Unfortunately, > it's not fool-proof. However, along with some other > heuristic used by Mozilla and elsewhere, it'll be > possible to achieve a high rate of the detection. > The following page has links to some other related pages. > http://trainedmonkey.com/week/2004/26 > In addition to the character encoding detection, we > also need to detect the language of a document, which > is even harder and should be a separate bug (although > it's related). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers