[jira] [Commented] (NUTCH-2236) Upgrade to Hadoop 2.7.1
[ https://issues.apache.org/jira/browse/NUTCH-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15171725#comment-15171725 ] Tien Nguyen Manh commented on NUTCH-2236: - No problem, just to make it run on Hadoop 2.7.1 > Upgrade to Hadoop 2.7.1 > --- > > Key: NUTCH-2236 > URL: https://issues.apache.org/jira/browse/NUTCH-2236 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.11 >Reporter: Tien Nguyen Manh >Assignee: Markus Jelsma > Fix For: 1.12 > > Attachments: NUTCH-2236.patch > > > Upgrade to Hadoop 2.7.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2234) Upgrade to elasticsearch 2.1.1
[ https://issues.apache.org/jira/browse/NUTCH-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15171264#comment-15171264 ] Tien Nguyen Manh commented on NUTCH-2234: - elasticsearch 2.1.1 use httpclient 4.3.6 > Upgrade to elasticsearch 2.1.1 > -- > > Key: NUTCH-2234 > URL: https://issues.apache.org/jira/browse/NUTCH-2234 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 1.11 >Reporter: Tien Nguyen Manh >Assignee: Markus Jelsma > Fix For: 1.12 > > Attachments: NUTCH-2234.patch > > > Currently we use elasticsearch 1.x, We should upgrade to 2.x -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2236) Upgrade to Hadoop 2.7.1
[ https://issues.apache.org/jira/browse/NUTCH-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-2236: Attachment: NUTCH-2236.patch I run Nutch 1.11 on Hadoop 2.7.1 with this patch. We also need add this line to etc/hadoop/mapred-env.sh export HADOOP_USER_CLASSPATH_FIRST=true > Upgrade to Hadoop 2.7.1 > --- > > Key: NUTCH-2236 > URL: https://issues.apache.org/jira/browse/NUTCH-2236 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.11 >Reporter: Tien Nguyen Manh > Attachments: NUTCH-2236.patch > > > Upgrade to Hadoop 2.7.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2236) Upgrade to Hadoop 2.7.1
Tien Nguyen Manh created NUTCH-2236: --- Summary: Upgrade to Hadoop 2.7.1 Key: NUTCH-2236 URL: https://issues.apache.org/jira/browse/NUTCH-2236 Project: Nutch Issue Type: Improvement Affects Versions: 1.11 Reporter: Tien Nguyen Manh Upgrade to Hadoop 2.7.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2234) Upgrade to elasticsearch 2.1.1
[ https://issues.apache.org/jira/browse/NUTCH-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-2234: Attachment: NUTCH-2234.patch > Upgrade to elasticsearch 2.1.1 > -- > > Key: NUTCH-2234 > URL: https://issues.apache.org/jira/browse/NUTCH-2234 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 1.11 >Reporter: Tien Nguyen Manh > Attachments: NUTCH-2234.patch > > > Currently we use elasticsearch 1.x, We should upgrade to 2.x -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1687) Pick queue in Round Robin
[ https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1687: Attachment: NUTCH-1687-2.patch Here it is: I update my initial patch for version 1.11. I crawl large number of hosts, so using circular linked list prevents creating new iterator every time a new hosts is added which happens quite frequent. > Pick queue in Round Robin > - > > Key: NUTCH-1687 > URL: https://issues.apache.org/jira/browse/NUTCH-1687 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Reporter: Tien Nguyen Manh >Priority: Minor > Attachments: NUTCH-1687-2.patch, NUTCH-1687.patch, > NUTCH-1687.tejasp.v1.patch > > > Currently we chose queue to pick url from start of queues list, so queue at > the start of list have more change to be pick first, that can cause problem > of long tail queue, which only few queue available at the end which have many > urls. > public synchronized FetchItem getFetchItem() { > final Iterator> it = > queues.entrySet().iterator(); ==> always reset to find queue from > start > while (it.hasNext()) { > > I think it is better to pick queue in round robin, that can make reduce time > to find the available queue and make all queue was picked in round robin and > if we use TopN during generator there are no long tail queue at the end. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1687) Pick queue in Round Robin
[ https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1687: Attachment: (was: NUTCH-1687-2.patch) > Pick queue in Round Robin > - > > Key: NUTCH-1687 > URL: https://issues.apache.org/jira/browse/NUTCH-1687 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Reporter: Tien Nguyen Manh >Priority: Minor > Attachments: NUTCH-1687.patch, NUTCH-1687.tejasp.v1.patch > > > Currently we chose queue to pick url from start of queues list, so queue at > the start of list have more change to be pick first, that can cause problem > of long tail queue, which only few queue available at the end which have many > urls. > public synchronized FetchItem getFetchItem() { > final Iterator> it = > queues.entrySet().iterator(); ==> always reset to find queue from > start > while (it.hasNext()) { > > I think it is better to pick queue in round robin, that can make reduce time > to find the available queue and make all queue was picked in round robin and > if we use TopN during generator there are no long tail queue at the end. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (NUTCH-1687) Pick queue in Round Robin
[ https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1687: Comment: was deleted (was: I update my initial patch for ver 1.11. I crawl large number of hosts, so using circular linked list prevents creating new iterator every time a new hosts is added.) > Pick queue in Round Robin > - > > Key: NUTCH-1687 > URL: https://issues.apache.org/jira/browse/NUTCH-1687 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Reporter: Tien Nguyen Manh >Priority: Minor > Attachments: NUTCH-1687.patch, NUTCH-1687.tejasp.v1.patch > > > Currently we chose queue to pick url from start of queues list, so queue at > the start of list have more change to be pick first, that can cause problem > of long tail queue, which only few queue available at the end which have many > urls. > public synchronized FetchItem getFetchItem() { > final Iterator> it = > queues.entrySet().iterator(); ==> always reset to find queue from > start > while (it.hasNext()) { > > I think it is better to pick queue in round robin, that can make reduce time > to find the available queue and make all queue was picked in round robin and > if we use TopN during generator there are no long tail queue at the end. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2234) Upgrade to elasticsearch 2.1.1
[ https://issues.apache.org/jira/browse/NUTCH-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-2234: Attachment: (was: NUTCH-2234.patch) > Upgrade to elasticsearch 2.1.1 > -- > > Key: NUTCH-2234 > URL: https://issues.apache.org/jira/browse/NUTCH-2234 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 1.11 >Reporter: Tien Nguyen Manh > > Currently we use elasticsearch 1.x, We should upgrade to 2.x -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2234) Upgrade to elasticsearch 2.1.1
[ https://issues.apache.org/jira/browse/NUTCH-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-2234: Attachment: NUTCH-2234.patch > Upgrade to elasticsearch 2.1.1 > -- > > Key: NUTCH-2234 > URL: https://issues.apache.org/jira/browse/NUTCH-2234 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 1.11 >Reporter: Tien Nguyen Manh > Attachments: NUTCH-2234.patch > > > Currently we use elasticsearch 1.x, We should upgrade to 2.x -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2234) Upgrade to elasticsearch 2.1.1
Tien Nguyen Manh created NUTCH-2234: --- Summary: Upgrade to elasticsearch 2.1.1 Key: NUTCH-2234 URL: https://issues.apache.org/jira/browse/NUTCH-2234 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.11 Reporter: Tien Nguyen Manh Currently we use elasticsearch 1.x, We should upgrade to 2.x -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1687) Pick queue in Round Robin
[ https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1687: Attachment: NUTCH-1687-2.patch I update my initial patch for ver 1.11. I crawl large number of hosts, so using circular linked list prevents creating new iterator every time a new hosts is added. > Pick queue in Round Robin > - > > Key: NUTCH-1687 > URL: https://issues.apache.org/jira/browse/NUTCH-1687 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Reporter: Tien Nguyen Manh >Priority: Minor > Attachments: NUTCH-1687-2.patch, NUTCH-1687.patch, > NUTCH-1687.tejasp.v1.patch > > > Currently we chose queue to pick url from start of queues list, so queue at > the start of list have more change to be pick first, that can cause problem > of long tail queue, which only few queue available at the end which have many > urls. > public synchronized FetchItem getFetchItem() { > final Iterator> it = > queues.entrySet().iterator(); ==> always reset to find queue from > start > while (it.hasNext()) { > > I think it is better to pick queue in round robin, that can make reduce time > to find the available queue and make all queue was picked in round robin and > if we use TopN during generator there are no long tail queue at the end. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2225) Parsed time not include time to parse
[ https://issues.apache.org/jira/browse/NUTCH-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-2225: Attachment: NUTCH-2225.patch > Parsed time not include time to parse > - > > Key: NUTCH-2225 > URL: https://issues.apache.org/jira/browse/NUTCH-2225 > Project: Nutch > Issue Type: Bug >Reporter: Tien Nguyen Manh >Priority: Trivial > Attachments: NUTCH-2225.patch > > > In ParseSegment we report parse time > LOG.info("Parsed (" + Long.toString(end - start) + "ms):" + url); > But the start time is the time after we parse so in log we see many "0 ms" -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2225) Parsed time not include time to parse
Tien Nguyen Manh created NUTCH-2225: --- Summary: Parsed time not include time to parse Key: NUTCH-2225 URL: https://issues.apache.org/jira/browse/NUTCH-2225 Project: Nutch Issue Type: Bug Reporter: Tien Nguyen Manh Priority: Trivial In ParseSegment we report parse time LOG.info("Parsed (" + Long.toString(end - start) + "ms):" + url); But the start time is the time after we parse so in log we see many "0 ms" -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2224) Wrong metric compute in Fetcher status report
[ https://issues.apache.org/jira/browse/NUTCH-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-2224: Attachment: NUTCH-2224.patch > Wrong metric compute in Fetcher status report > - > > Key: NUTCH-2224 > URL: https://issues.apache.org/jira/browse/NUTCH-2224 > Project: Nutch > Issue Type: Bug >Reporter: Tien Nguyen Manh >Priority: Trivial > Attachments: NUTCH-2224.patch > > > Currently we convert from bytes to kbits by > (bytes.get() / 125l) > I thinks it should be /128l -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2224) Wrong metric compute in Fetcher status report
Tien Nguyen Manh created NUTCH-2224: --- Summary: Wrong metric compute in Fetcher status report Key: NUTCH-2224 URL: https://issues.apache.org/jira/browse/NUTCH-2224 Project: Nutch Issue Type: Bug Reporter: Tien Nguyen Manh Priority: Trivial Currently we convert from bytes to kbits by (bytes.get() / 125l) I thinks it should be /128l -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2223) Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection
[ https://issues.apache.org/jira/browse/NUTCH-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-2223: Attachment: NUTCH-2223.patch Patch for nutch 1.11 > Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection > > > Key: NUTCH-2223 > URL: https://issues.apache.org/jira/browse/NUTCH-2223 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.11 >Reporter: Tien Nguyen Manh >Priority: Minor > Attachments: NUTCH-2223.patch > > > Stracktrace for the hang seems to be: > at org.apache.xerces.impl.XMLScanner.scanExternalID(Unknown Source) > at org.apache.xerces.impl.XMLDocumentScannerImpl.scanDoctypeDecl(Unknown > Source) > at > org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher.dispatch(Unknown > Source) > at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown > Source) > at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) > at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) > at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) > at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) > at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source) > at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source) > at javax.xml.parsers.SAXParser.parse(SAXParser.java:195) > at > org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:54) > at > org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:41) > at org.apache.tika.mime.MimeTypes.getMimeType(MimeTypes.java:192) > at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:439) > at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61) > at org.apache.tika.cli.TikaCLI$10.process(TikaCLI.java:252) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:417) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:111) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2223) Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection
[ https://issues.apache.org/jira/browse/NUTCH-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-2223: Fix Version/s: 1.13 > Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection > > > Key: NUTCH-2223 > URL: https://issues.apache.org/jira/browse/NUTCH-2223 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.11 >Reporter: Tien Nguyen Manh >Priority: Minor > > Stracktrace for the hang seems to be: > at org.apache.xerces.impl.XMLScanner.scanExternalID(Unknown Source) > at org.apache.xerces.impl.XMLDocumentScannerImpl.scanDoctypeDecl(Unknown > Source) > at > org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher.dispatch(Unknown > Source) > at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown > Source) > at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) > at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) > at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) > at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) > at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source) > at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source) > at javax.xml.parsers.SAXParser.parse(SAXParser.java:195) > at > org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:54) > at > org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:41) > at org.apache.tika.mime.MimeTypes.getMimeType(MimeTypes.java:192) > at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:439) > at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61) > at org.apache.tika.cli.TikaCLI$10.process(TikaCLI.java:252) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:417) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:111) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2223) Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection
Tien Nguyen Manh created NUTCH-2223: --- Summary: Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection Key: NUTCH-2223 URL: https://issues.apache.org/jira/browse/NUTCH-2223 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.11 Reporter: Tien Nguyen Manh Priority: Minor Stracktrace for the hang seems to be: at org.apache.xerces.impl.XMLScanner.scanExternalID(Unknown Source) at org.apache.xerces.impl.XMLDocumentScannerImpl.scanDoctypeDecl(Unknown Source) at org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source) at javax.xml.parsers.SAXParser.parse(SAXParser.java:195) at org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:54) at org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:41) at org.apache.tika.mime.MimeTypes.getMimeType(MimeTypes.java:192) at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:439) at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61) at org.apache.tika.cli.TikaCLI$10.process(TikaCLI.java:252) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:417) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:111) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2223) Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection
[ https://issues.apache.org/jira/browse/NUTCH-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-2223: Fix Version/s: (was: 1.13) > Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection > > > Key: NUTCH-2223 > URL: https://issues.apache.org/jira/browse/NUTCH-2223 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.11 >Reporter: Tien Nguyen Manh >Priority: Minor > > Stracktrace for the hang seems to be: > at org.apache.xerces.impl.XMLScanner.scanExternalID(Unknown Source) > at org.apache.xerces.impl.XMLDocumentScannerImpl.scanDoctypeDecl(Unknown > Source) > at > org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher.dispatch(Unknown > Source) > at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown > Source) > at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) > at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) > at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) > at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) > at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source) > at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source) > at javax.xml.parsers.SAXParser.parse(SAXParser.java:195) > at > org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:54) > at > org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:41) > at org.apache.tika.mime.MimeTypes.getMimeType(MimeTypes.java:192) > at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:439) > at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61) > at org.apache.tika.cli.TikaCLI$10.process(TikaCLI.java:252) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:417) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:111) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15117020#comment-15117020 ] Tien Nguyen Manh commented on NUTCH-961: Can NUTCH-1233: use tika to extract outlink solve that problem? > Expose Tika's boilerpipe support > > > Key: NUTCH-961 > URL: https://issues.apache.org/jira/browse/NUTCH-961 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Attachments: BoilerpipeExtractorRepository.java, > NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, > NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, > NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, > NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, > NUTCH-961v2.patch, nutch-2.x-boilerpipe.patch > > > Tika 0.8 comes with the Boilerpipe content handler which can be used to > extract boilerplate content from HTML pages. We should see how we can expose > Boilerplate in the Nutch cofiguration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (NUTCH-961) Expose Tika's boilerpipe support
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15116772#comment-15116772 ] Tien Nguyen Manh edited comment on NUTCH-961 at 1/26/16 6:57 AM: - AH yes, Could you explain why we need to parse it twice? with NUTCH-1233 we can use just 1 parse? was (Author: tiennm): AH yes, Could you explain why we need to parse it twice? > Expose Tika's boilerpipe support > > > Key: NUTCH-961 > URL: https://issues.apache.org/jira/browse/NUTCH-961 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Attachments: BoilerpipeExtractorRepository.java, > NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, > NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, > NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, > NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, > NUTCH-961v2.patch, nutch-2.x-boilerpipe.patch > > > Tika 0.8 comes with the Boilerpipe content handler which can be used to > extract boilerplate content from HTML pages. We should see how we can expose > Boilerplate in the Nutch cofiguration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15116772#comment-15116772 ] Tien Nguyen Manh commented on NUTCH-961: AH yes, Could you explain why we need to parse it twice? > Expose Tika's boilerpipe support > > > Key: NUTCH-961 > URL: https://issues.apache.org/jira/browse/NUTCH-961 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Attachments: BoilerpipeExtractorRepository.java, > NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, > NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, > NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, > NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, > NUTCH-961v2.patch, nutch-2.x-boilerpipe.patch > > > Tika 0.8 comes with the Boilerpipe content handler which can be used to > extract boilerplate content from HTML pages. We should see how we can expose > Boilerplate in the Nutch cofiguration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15114658#comment-15114658 ] Tien Nguyen Manh commented on NUTCH-961: One note with boilerpipe support, it is significant slower than parse-html. I tested to parse the same segment and here are results parse-html: 3hm, parse-tika with boilerpipe 5h10m and parse-tika without poilerpipe 4h. > Expose Tika's boilerpipe support > > > Key: NUTCH-961 > URL: https://issues.apache.org/jira/browse/NUTCH-961 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Attachments: BoilerpipeExtractorRepository.java, > NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, > NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, > NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, > NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, > NUTCH-961v2.patch, nutch-2.x-boilerpipe.patch > > > Tika 0.8 comes with the Boilerpipe content handler which can be used to > extract boilerplate content from HTML pages. We should see how we can expose > Boilerplate in the Nutch cofiguration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15110217#comment-15110217 ] Tien Nguyen Manh commented on NUTCH-961: i'm using this patch NUTCH-961-1.11-1.patch, it works fine when run from eclipse & run in hadoop. It have problem when i run in local mode It throws exception: "Can't retrieve Tika parser for mime-type text/html". It is not problem with parse-plugins.xml. It seem problem with TikaConfig constructor TikaConfig(ClassLoader loader), it failed to load some config via classLoader when run in local mode. > Expose Tika's boilerpipe support > > > Key: NUTCH-961 > URL: https://issues.apache.org/jira/browse/NUTCH-961 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Attachments: BoilerpipeExtractorRepository.java, > NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, > NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, > NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, > NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, > NUTCH-961v2.patch, nutch-2.x-boilerpipe.patch > > > Tika 0.8 comes with the Boilerpipe content handler which can be used to > extract boilerplate content from HTML pages. We should see how we can expose > Boilerplate in the Nutch cofiguration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.
[ https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1679: Attachment: NUTCH-1679-2.patch I have another solution. With a new link in DbUpdaterReducer we only add url, no status, no fetchtime or any other info. - So if this link is already exist in database, we don't override anything. - Otherwise, it is actually a new link, it will have status = 0 (default value) and we will initialize (set status, fetch time, ...) it in Generator instead. I tested it with hbase backend on nutch-2.3 > UpdateDb using batchId, link may override crawled page. > --- > > Key: NUTCH-1679 > URL: https://issues.apache.org/jira/browse/NUTCH-1679 > Project: Nutch > Issue Type: Bug >Affects Versions: 2.2.1 >Reporter: Tien Nguyen Manh >Priority: Critical > Fix For: 2.3.1 > > Attachments: NUTCH-1679-2.patch, NUTCH-1679.patch > > > The problem is in Hbase store, not sure about other store. > Suppose at first crawl cycle we crawl link A, then get an outlink B. > In second cycle we crawl link B which also has a link point to A > In second updatedb we load only page B from store, and will add A as new link > because it doesn't know A already exist in store and will override A. > UpdateDb must be run without batchId or we must set additionsAllowed=false > Here are code for new page > page = new WebPage(); > schedule.initializeSchedule(url, page); > page.setStatus(CrawlStatus.STATUS_UNFETCHED); > try { > scoringFilters.initialScore(url, page); > } catch (ScoringFilterException e) { > page.setScore(0.0f); > } > new page will override old page status, score, fetchTime, fetchInterval, > retries, metadata[CASH_KEY] > - i think we can change something here so that new page will only update one > column for example 'link' and if it is really a new page, we can initialize > all above fields in generator > - or we add operator checkAndPut to store so when add new page we will check > if already exist first -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-1705) Make configuration option for HtmlParser & TikaParser to extract text or title for noIndex page
Tien Nguyen Manh created NUTCH-1705: --- Summary: Make configuration option for HtmlParser & TikaParser to extract text or title for noIndex page Key: NUTCH-1705 URL: https://issues.apache.org/jira/browse/NUTCH-1705 Project: Nutch Issue Type: Improvement Reporter: Tien Nguyen Manh Priority: Minor Currently HtmlParser and TikaParser always skip extracting text and title for noIndex page - page which have noIndex robots metatags. But some parse-filter may still interested in text and title such as NUTCH-1661, where we may decide wether to follow a page by it's language. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1705) Make configuration option for HtmlParser & TikaParser to extract text or title for noIndex page
[ https://issues.apache.org/jira/browse/NUTCH-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1705: Attachment: NUTCH-1705.patch > Make configuration option for HtmlParser & TikaParser to extract text or > title for noIndex page > --- > > Key: NUTCH-1705 > URL: https://issues.apache.org/jira/browse/NUTCH-1705 > Project: Nutch > Issue Type: Improvement >Reporter: Tien Nguyen Manh >Priority: Minor > Attachments: NUTCH-1705.patch > > > Currently HtmlParser and TikaParser always skip extracting text and title for > noIndex page - page which have noIndex robots metatags. > But some parse-filter may still interested in text and title such as > NUTCH-1661, where we may decide wether to follow a page by it's language. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1478) Parse-metatags and index-metadata plugin for Nutch 2.x series
[ https://issues.apache.org/jira/browse/NUTCH-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1478: Attachment: NUTCH-1478-parse-v2.patch i port parse-metatags to 2.x, this patch support multi-value in metatags. > Parse-metatags and index-metadata plugin for Nutch 2.x series > -- > > Key: NUTCH-1478 > URL: https://issues.apache.org/jira/browse/NUTCH-1478 > Project: Nutch > Issue Type: Improvement > Components: parser >Affects Versions: 2.1 >Reporter: kiran > Fix For: 2.3 > > Attachments: NUTCH-1478-parse-v2.patch, Nutch1478.patch, > Nutch1478.zip, metadata_parseChecker_sites.png > > > I have ported parse-metatags and index-metadata plugin to Nutch 2.x series. > This will take multiple values of same tag and index in Solr as i patched > before (https://issues.apache.org/jira/browse/NUTCH-1467). > The usage is same as described here > (http://wiki.apache.org/nutch/IndexMetatags) but one change is that there is > no need to give 'metatag' keyword before metatag names. For example my > configuration looks like this > (https://github.com/salvager/NutchDev/blob/master/runtime/local/conf/nutch-site.xml) > > This is only the first version and does not include the junit test. I will > update the new version soon. > This will parse the tags and index the tags in Solr. Make sure you create the > fields in 'index.parse.md' in nutch-site.xml in schema.xml in Solr. > Please let me know if you have any suggestions > This is supported by DLA (Digital Library and Archives) of Virginia Tech. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (NUTCH-1704) Port DomainBlacklist urlfilter to 2.x
Tien Nguyen Manh created NUTCH-1704: --- Summary: Port DomainBlacklist urlfilter to 2.x Key: NUTCH-1704 URL: https://issues.apache.org/jira/browse/NUTCH-1704 Project: Nutch Issue Type: Improvement Reporter: Tien Nguyen Manh Attachments: NUTCH-1704.patch Port NUTCH-1210 to 2.x -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1704) Port DomainBlacklist urlfilter to 2.x
[ https://issues.apache.org/jira/browse/NUTCH-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1704: Attachment: NUTCH-1704.patch > Port DomainBlacklist urlfilter to 2.x > - > > Key: NUTCH-1704 > URL: https://issues.apache.org/jira/browse/NUTCH-1704 > Project: Nutch > Issue Type: Improvement >Reporter: Tien Nguyen Manh > Attachments: NUTCH-1704.patch > > > Port NUTCH-1210 to 2.x -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1702) Port HostNormalizer to 2.x
[ https://issues.apache.org/jira/browse/NUTCH-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1702: Attachment: (was: NUTCH-1702.patch) > Port HostNormalizer to 2.x > -- > > Key: NUTCH-1702 > URL: https://issues.apache.org/jira/browse/NUTCH-1702 > Project: Nutch > Issue Type: Improvement >Reporter: Tien Nguyen Manh > Fix For: 2.3 > > Attachments: NUTCH-1702.patch > > > Port NUTCH-1319 to 2.x -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1702) Port HostNormalizer to 2.x
[ https://issues.apache.org/jira/browse/NUTCH-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1702: Attachment: NUTCH-1702.patch > Port HostNormalizer to 2.x > -- > > Key: NUTCH-1702 > URL: https://issues.apache.org/jira/browse/NUTCH-1702 > Project: Nutch > Issue Type: Improvement >Reporter: Tien Nguyen Manh > Fix For: 2.3 > > Attachments: NUTCH-1702.patch > > > Port NUTCH-1319 to 2.x -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1702) Port HostNormalizer to 2.x
[ https://issues.apache.org/jira/browse/NUTCH-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1702: Fix Version/s: 2.3 > Port HostNormalizer to 2.x > -- > > Key: NUTCH-1702 > URL: https://issues.apache.org/jira/browse/NUTCH-1702 > Project: Nutch > Issue Type: Improvement >Reporter: Tien Nguyen Manh > Fix For: 2.3 > > Attachments: NUTCH-1702.patch > > > Port NUTCH-1319 to 2.x -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1702) Port HostNormalizer to 2.x
[ https://issues.apache.org/jira/browse/NUTCH-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1702: Attachment: NUTCH-1702.patch > Port HostNormalizer to 2.x > -- > > Key: NUTCH-1702 > URL: https://issues.apache.org/jira/browse/NUTCH-1702 > Project: Nutch > Issue Type: Improvement >Reporter: Tien Nguyen Manh > Fix For: 2.3 > > Attachments: NUTCH-1702.patch > > > Port NUTCH-1319 to 2.x -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (NUTCH-1702) Port HostNormalizer to 2.x
Tien Nguyen Manh created NUTCH-1702: --- Summary: Port HostNormalizer to 2.x Key: NUTCH-1702 URL: https://issues.apache.org/jira/browse/NUTCH-1702 Project: Nutch Issue Type: Improvement Reporter: Tien Nguyen Manh Port NUTCH-1319 to 2.x -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1701) Make Solr Document Boost as an option
[ https://issues.apache.org/jira/browse/NUTCH-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1701: Attachment: NUTCH-1701-2x.patch > Make Solr Document Boost as an option > - > > Key: NUTCH-1701 > URL: https://issues.apache.org/jira/browse/NUTCH-1701 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Tien Nguyen Manh >Priority: Minor > Fix For: 2.3, 1.8 > > Attachments: NUTCH-1701-2x.patch > > > Nutch SolrIndexer use Nutch score as document boost by default. We should > make it as an option because we can use nutch score to boost in different way > such as boost at query time via function query -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1701) Make Solr Document Boost as an option
[ https://issues.apache.org/jira/browse/NUTCH-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1701: Fix Version/s: 1.8 2.3 > Make Solr Document Boost as an option > - > > Key: NUTCH-1701 > URL: https://issues.apache.org/jira/browse/NUTCH-1701 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Tien Nguyen Manh >Priority: Minor > Fix For: 2.3, 1.8 > > Attachments: NUTCH-1701-2x.patch > > > Nutch SolrIndexer use Nutch score as document boost by default. We should > make it as an option because we can use nutch score to boost in different way > such as boost at query time via function query -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (NUTCH-1701) Make Solr Document Boost as an option
Tien Nguyen Manh created NUTCH-1701: --- Summary: Make Solr Document Boost as an option Key: NUTCH-1701 URL: https://issues.apache.org/jira/browse/NUTCH-1701 Project: Nutch Issue Type: Improvement Components: indexer Reporter: Tien Nguyen Manh Priority: Minor Nutch SolrIndexer use Nutch score as document boost by default. We should make it as an option because we can use nutch score to boost in different way such as boost at query time via function query -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1693) TextMD5Signatue compute on textual content
[ https://issues.apache.org/jira/browse/NUTCH-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13861195#comment-13861195 ] Tien Nguyen Manh commented on NUTCH-1693: - this patch only work with a minor change that compute signature after seting text to "page" that i made in NUTCH-1686 > TextMD5Signatue compute on textual content > -- > > Key: NUTCH-1693 > URL: https://issues.apache.org/jira/browse/NUTCH-1693 > Project: Nutch > Issue Type: New Feature >Reporter: Tien Nguyen Manh >Priority: Minor > Fix For: 2.3 > > Attachments: NUTCH-1693.patch > > > I create a new MD5Signature that based on textual content. In our case we use > boilerpipe to extract main text from content so this signature is more > effective to deduplicate. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1693) TextMD5Signatue compute on textual content
[ https://issues.apache.org/jira/browse/NUTCH-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1693: Fix Version/s: 2.3 > TextMD5Signatue compute on textual content > -- > > Key: NUTCH-1693 > URL: https://issues.apache.org/jira/browse/NUTCH-1693 > Project: Nutch > Issue Type: New Feature >Reporter: Tien Nguyen Manh >Priority: Minor > Fix For: 2.3 > > Attachments: NUTCH-1693.patch > > > I create a new MD5Signature that based on textual content. In our case we use > boilerpipe to extract main text from content so this signature is more > effective to deduplicate. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1693) TextMD5Signatue compute on textual content
[ https://issues.apache.org/jira/browse/NUTCH-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1693: Attachment: NUTCH-1693.patch > TextMD5Signatue compute on textual content > -- > > Key: NUTCH-1693 > URL: https://issues.apache.org/jira/browse/NUTCH-1693 > Project: Nutch > Issue Type: Bug >Reporter: Tien Nguyen Manh >Priority: Minor > Fix For: 2.3 > > Attachments: NUTCH-1693.patch > > > I create a new MD5Signature that based on textual content. In our case we use > boilerpipe to extract main text from content so this signature is more > effective to deduplicate. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1693) TextMD5Signatue compute on textual content
[ https://issues.apache.org/jira/browse/NUTCH-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1693: Issue Type: New Feature (was: Bug) > TextMD5Signatue compute on textual content > -- > > Key: NUTCH-1693 > URL: https://issues.apache.org/jira/browse/NUTCH-1693 > Project: Nutch > Issue Type: New Feature >Reporter: Tien Nguyen Manh >Priority: Minor > Fix For: 2.3 > > Attachments: NUTCH-1693.patch > > > I create a new MD5Signature that based on textual content. In our case we use > boilerpipe to extract main text from content so this signature is more > effective to deduplicate. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (NUTCH-1693) TextMD5Signatue compute on textual content
Tien Nguyen Manh created NUTCH-1693: --- Summary: TextMD5Signatue compute on textual content Key: NUTCH-1693 URL: https://issues.apache.org/jira/browse/NUTCH-1693 Project: Nutch Issue Type: Bug Reporter: Tien Nguyen Manh Priority: Minor I create a new MD5Signature that based on textual content. In our case we use boilerpipe to extract main text from content so this signature is more effective to deduplicate. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1686) Optimize UpdateDb to load less field from Store
[ https://issues.apache.org/jira/browse/NUTCH-1686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13861142#comment-13861142 ] Tien Nguyen Manh commented on NUTCH-1686: - In this patch i also fixed an bug with fetchTime. Currently each time we run whole updatedb, fetchTime is increased again for all urls. > Optimize UpdateDb to load less field from Store > --- > > Key: NUTCH-1686 > URL: https://issues.apache.org/jira/browse/NUTCH-1686 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.3 >Reporter: Tien Nguyen Manh > Fix For: 2.3 > > Attachments: NUTCH-1686.patch > > > While running large crawl i found that updatedb run very slow, especially the > Map task which loading data from store. > We can't use filter by batchId to load less url due to bug in NUTCH-1679 so > we must always update the whole table. > After checking the field loaded in UpdateDbJob i found that it load many > fields from store (at least 15/25 field) which make updatedb slow > I think that UpdateDbJob only need to load few field: SCORE, OUTLINKS, > METADATA which is used to compute link score, distance that i think the main > purpose of this job. > The other fields is used to compute url schedule to parser and fetcher, we > can move code to Parser or Fetcher whithout loading much new field because > many field are generated from parser. WE can also use gora filter for Fetcher > or Parser so load new field is not a problem. > I also add new field SCOREMETA to WebPage to store CASH, and DISTANCE. It is > currently store in METADATA. field CASH is used in OPICScoring which is used > only in UpdateDB and distance is used only in Generator and Updater so move > both field two new Metadata field can prevent reading METADATA at Generator > and Updater, METADATA contains many data that is used only at Parser and > Indexer > So with new change > UpdateDb only load SCORE, SCOREMATA (CASH, DISTANCE), OUTLINKS, MAKERS: we > don't need to load big family Fetch and INLINKS. > Generator only load SCOREMETA (which is smaller than current METADATA) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1687) Pick queue in Round Robin
[ https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13859364#comment-13859364 ] Tien Nguyen Manh commented on NUTCH-1687: - It is nice! > Pick queue in Round Robin > - > > Key: NUTCH-1687 > URL: https://issues.apache.org/jira/browse/NUTCH-1687 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Reporter: Tien Nguyen Manh >Priority: Minor > Fix For: 2.3, 1.8 > > Attachments: NUTCH-1687.patch, NUTCH-1687.tejasp.v1.patch > > > Currently we chose queue to pick url from start of queues list, so queue at > the start of list have more change to be pick first, that can cause problem > of long tail queue, which only few queue available at the end which have many > urls. > public synchronized FetchItem getFetchItem() { > final Iterator> it = > queues.entrySet().iterator(); ==> always reset to find queue from > start > while (it.hasNext()) { > > I think it is better to pick queue in round robin, that can make reduce time > to find the available queue and make all queue was picked in round robin and > if we use TopN during generator there are no long tail queue at the end. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1687) Pick queue in Round Robin
[ https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13859299#comment-13859299 ] Tien Nguyen Manh commented on NUTCH-1687: - 1. It seem redundant in this context. 2. i add id, so that queues map can delete FetchItemQueue by it's id quickly, if not we must navigate from start of queues. > Pick queue in Round Robin > - > > Key: NUTCH-1687 > URL: https://issues.apache.org/jira/browse/NUTCH-1687 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Reporter: Tien Nguyen Manh >Priority: Minor > Fix For: 2.3, 1.8 > > Attachments: NUTCH-1687.patch > > > Currently we chose queue to pick url from start of queues list, so queue at > the start of list have more change to be pick first, that can cause problem > of long tail queue, which only few queue available at the end which have many > urls. > public synchronized FetchItem getFetchItem() { > final Iterator> it = > queues.entrySet().iterator(); ==> always reset to find queue from > start > while (it.hasNext()) { > > I think it is better to pick queue in round robin, that can make reduce time > to find the available queue and make all queue was picked in round robin and > if we use TopN during generator there are no long tail queue at the end. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1687) Pick queue in Round Robin
[ https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1687: Attachment: NUTCH-1687.patch add Apache Header fixed lost tail pointer when deleting > Pick queue in Round Robin > - > > Key: NUTCH-1687 > URL: https://issues.apache.org/jira/browse/NUTCH-1687 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Reporter: Tien Nguyen Manh >Priority: Minor > Fix For: 2.3 > > Attachments: NUTCH-1687.patch > > > Currently we chose queue to pick url from start of queues list, so queue at > the start of list have more change to be pick first, that can cause problem > of long tail queue, which only few queue available at the end which have many > urls. > public synchronized FetchItem getFetchItem() { > final Iterator> it = > queues.entrySet().iterator(); ==> always reset to find queue from > start > while (it.hasNext()) { > > I think it is better to pick queue in round robin, that can make reduce time > to find the available queue and make all queue was picked in round robin and > if we use TopN during generator there are no long tail queue at the end. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1687) Pick queue in Round Robin
[ https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1687: Attachment: (was: NUTCH-1687.patch) > Pick queue in Round Robin > - > > Key: NUTCH-1687 > URL: https://issues.apache.org/jira/browse/NUTCH-1687 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Reporter: Tien Nguyen Manh >Priority: Minor > Fix For: 2.3 > > Attachments: NUTCH-1687.patch > > > Currently we chose queue to pick url from start of queues list, so queue at > the start of list have more change to be pick first, that can cause problem > of long tail queue, which only few queue available at the end which have many > urls. > public synchronized FetchItem getFetchItem() { > final Iterator> it = > queues.entrySet().iterator(); ==> always reset to find queue from > start > while (it.hasNext()) { > > I think it is better to pick queue in round robin, that can make reduce time > to find the available queue and make all queue was picked in round robin and > if we use TopN during generator there are no long tail queue at the end. -- This message was sent by Atlassian JIRA (v6.1.5#6160)