[jira] [Commented] (NUTCH-2236) Upgrade to Hadoop 2.7.1

2016-02-29 Thread Tien Nguyen Manh (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171725#comment-15171725
 ] 

Tien Nguyen Manh commented on NUTCH-2236:
-

No problem, just to make it run on Hadoop 2.7.1

> Upgrade to Hadoop 2.7.1
> ---
>
> Key: NUTCH-2236
> URL: https://issues.apache.org/jira/browse/NUTCH-2236
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.11
>Reporter: Tien Nguyen Manh
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2236.patch
>
>
> Upgrade to Hadoop 2.7.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2234) Upgrade to elasticsearch 2.1.1

2016-02-28 Thread Tien Nguyen Manh (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171264#comment-15171264
 ] 

Tien Nguyen Manh commented on NUTCH-2234:
-

elasticsearch 2.1.1 use httpclient 4.3.6

> Upgrade to elasticsearch 2.1.1
> --
>
> Key: NUTCH-2234
> URL: https://issues.apache.org/jira/browse/NUTCH-2234
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.11
>Reporter: Tien Nguyen Manh
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2234.patch
>
>
> Currently we use elasticsearch 1.x, We should upgrade to 2.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2236) Upgrade to Hadoop 2.7.1

2016-02-28 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-2236:

Attachment: NUTCH-2236.patch

I run Nutch 1.11 on Hadoop 2.7.1 with this patch.
We also need add this line to etc/hadoop/mapred-env.sh
export HADOOP_USER_CLASSPATH_FIRST=true


> Upgrade to Hadoop 2.7.1
> ---
>
> Key: NUTCH-2236
> URL: https://issues.apache.org/jira/browse/NUTCH-2236
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.11
>Reporter: Tien Nguyen Manh
> Attachments: NUTCH-2236.patch
>
>
> Upgrade to Hadoop 2.7.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2236) Upgrade to Hadoop 2.7.1

2016-02-28 Thread Tien Nguyen Manh (JIRA)
Tien Nguyen Manh created NUTCH-2236:
---

 Summary: Upgrade to Hadoop 2.7.1
 Key: NUTCH-2236
 URL: https://issues.apache.org/jira/browse/NUTCH-2236
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.11
Reporter: Tien Nguyen Manh


Upgrade to Hadoop 2.7.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2234) Upgrade to elasticsearch 2.1.1

2016-02-24 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-2234:

Attachment: NUTCH-2234.patch

> Upgrade to elasticsearch 2.1.1
> --
>
> Key: NUTCH-2234
> URL: https://issues.apache.org/jira/browse/NUTCH-2234
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.11
>Reporter: Tien Nguyen Manh
> Attachments: NUTCH-2234.patch
>
>
> Currently we use elasticsearch 1.x, We should upgrade to 2.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1687) Pick queue in Round Robin

2016-02-24 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1687:

Attachment: NUTCH-1687-2.patch

Here it is:
I update my initial patch for version 1.11.
I crawl large number of hosts, so using circular linked list prevents creating 
new iterator every time a new hosts is added which happens quite frequent.

> Pick queue in Round Robin
> -
>
> Key: NUTCH-1687
> URL: https://issues.apache.org/jira/browse/NUTCH-1687
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Tien Nguyen Manh
>Priority: Minor
> Attachments: NUTCH-1687-2.patch, NUTCH-1687.patch, 
> NUTCH-1687.tejasp.v1.patch
>
>
> Currently we chose queue to pick url from start of queues list, so queue at 
> the start of list have more change to be pick first, that can cause problem 
> of long tail queue, which only few queue available at the end which have many 
> urls.
> public synchronized FetchItem getFetchItem() {
>   final Iterator> it =
> queues.entrySet().iterator(); ==> always reset to find queue from 
> start
>   while (it.hasNext()) {
> 
> I think it is better to pick queue in round robin, that can make reduce time 
> to find the available queue and make all queue was picked in round robin and 
> if we use TopN during generator there are no long tail queue at the end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1687) Pick queue in Round Robin

2016-02-24 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1687:

Attachment: (was: NUTCH-1687-2.patch)

> Pick queue in Round Robin
> -
>
> Key: NUTCH-1687
> URL: https://issues.apache.org/jira/browse/NUTCH-1687
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Tien Nguyen Manh
>Priority: Minor
> Attachments: NUTCH-1687.patch, NUTCH-1687.tejasp.v1.patch
>
>
> Currently we chose queue to pick url from start of queues list, so queue at 
> the start of list have more change to be pick first, that can cause problem 
> of long tail queue, which only few queue available at the end which have many 
> urls.
> public synchronized FetchItem getFetchItem() {
>   final Iterator> it =
> queues.entrySet().iterator(); ==> always reset to find queue from 
> start
>   while (it.hasNext()) {
> 
> I think it is better to pick queue in round robin, that can make reduce time 
> to find the available queue and make all queue was picked in round robin and 
> if we use TopN during generator there are no long tail queue at the end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (NUTCH-1687) Pick queue in Round Robin

2016-02-24 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1687:

Comment: was deleted

(was: I update my initial patch for ver 1.11.
I crawl large number of hosts, so using circular linked list prevents creating 
new iterator every time a new hosts is added.)

> Pick queue in Round Robin
> -
>
> Key: NUTCH-1687
> URL: https://issues.apache.org/jira/browse/NUTCH-1687
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Tien Nguyen Manh
>Priority: Minor
> Attachments: NUTCH-1687.patch, NUTCH-1687.tejasp.v1.patch
>
>
> Currently we chose queue to pick url from start of queues list, so queue at 
> the start of list have more change to be pick first, that can cause problem 
> of long tail queue, which only few queue available at the end which have many 
> urls.
> public synchronized FetchItem getFetchItem() {
>   final Iterator> it =
> queues.entrySet().iterator(); ==> always reset to find queue from 
> start
>   while (it.hasNext()) {
> 
> I think it is better to pick queue in round robin, that can make reduce time 
> to find the available queue and make all queue was picked in round robin and 
> if we use TopN during generator there are no long tail queue at the end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2234) Upgrade to elasticsearch 2.1.1

2016-02-24 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-2234:

Attachment: (was: NUTCH-2234.patch)

> Upgrade to elasticsearch 2.1.1
> --
>
> Key: NUTCH-2234
> URL: https://issues.apache.org/jira/browse/NUTCH-2234
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.11
>Reporter: Tien Nguyen Manh
>
> Currently we use elasticsearch 1.x, We should upgrade to 2.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2234) Upgrade to elasticsearch 2.1.1

2016-02-24 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-2234:

Attachment: NUTCH-2234.patch

> Upgrade to elasticsearch 2.1.1
> --
>
> Key: NUTCH-2234
> URL: https://issues.apache.org/jira/browse/NUTCH-2234
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.11
>Reporter: Tien Nguyen Manh
> Attachments: NUTCH-2234.patch
>
>
> Currently we use elasticsearch 1.x, We should upgrade to 2.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2234) Upgrade to elasticsearch 2.1.1

2016-02-24 Thread Tien Nguyen Manh (JIRA)
Tien Nguyen Manh created NUTCH-2234:
---

 Summary: Upgrade to elasticsearch 2.1.1
 Key: NUTCH-2234
 URL: https://issues.apache.org/jira/browse/NUTCH-2234
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.11
Reporter: Tien Nguyen Manh


Currently we use elasticsearch 1.x, We should upgrade to 2.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1687) Pick queue in Round Robin

2016-02-24 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1687:

Attachment: NUTCH-1687-2.patch

I update my initial patch for ver 1.11.
I crawl large number of hosts, so using circular linked list prevents creating 
new iterator every time a new hosts is added.

> Pick queue in Round Robin
> -
>
> Key: NUTCH-1687
> URL: https://issues.apache.org/jira/browse/NUTCH-1687
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Tien Nguyen Manh
>Priority: Minor
> Attachments: NUTCH-1687-2.patch, NUTCH-1687.patch, 
> NUTCH-1687.tejasp.v1.patch
>
>
> Currently we chose queue to pick url from start of queues list, so queue at 
> the start of list have more change to be pick first, that can cause problem 
> of long tail queue, which only few queue available at the end which have many 
> urls.
> public synchronized FetchItem getFetchItem() {
>   final Iterator> it =
> queues.entrySet().iterator(); ==> always reset to find queue from 
> start
>   while (it.hasNext()) {
> 
> I think it is better to pick queue in round robin, that can make reduce time 
> to find the available queue and make all queue was picked in round robin and 
> if we use TopN during generator there are no long tail queue at the end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2225) Parsed time not include time to parse

2016-02-17 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-2225:

Attachment: NUTCH-2225.patch

> Parsed time not include time to parse
> -
>
> Key: NUTCH-2225
> URL: https://issues.apache.org/jira/browse/NUTCH-2225
> Project: Nutch
>  Issue Type: Bug
>Reporter: Tien Nguyen Manh
>Priority: Trivial
> Attachments: NUTCH-2225.patch
>
>
> In ParseSegment we report parse time
> LOG.info("Parsed (" + Long.toString(end - start) + "ms):" + url);
> But the start time is the time after we parse so in log we see many "0 ms"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2225) Parsed time not include time to parse

2016-02-17 Thread Tien Nguyen Manh (JIRA)
Tien Nguyen Manh created NUTCH-2225:
---

 Summary: Parsed time not include time to parse
 Key: NUTCH-2225
 URL: https://issues.apache.org/jira/browse/NUTCH-2225
 Project: Nutch
  Issue Type: Bug
Reporter: Tien Nguyen Manh
Priority: Trivial


In ParseSegment we report parse time
LOG.info("Parsed (" + Long.toString(end - start) + "ms):" + url);
But the start time is the time after we parse so in log we see many "0 ms"




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2224) Wrong metric compute in Fetcher status report

2016-02-17 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-2224:

Attachment: NUTCH-2224.patch

> Wrong metric compute in Fetcher status report
> -
>
> Key: NUTCH-2224
> URL: https://issues.apache.org/jira/browse/NUTCH-2224
> Project: Nutch
>  Issue Type: Bug
>Reporter: Tien Nguyen Manh
>Priority: Trivial
> Attachments: NUTCH-2224.patch
>
>
> Currently we convert from bytes to kbits by
> (bytes.get() / 125l)
> I thinks it should be /128l



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2224) Wrong metric compute in Fetcher status report

2016-02-17 Thread Tien Nguyen Manh (JIRA)
Tien Nguyen Manh created NUTCH-2224:
---

 Summary: Wrong metric compute in Fetcher status report
 Key: NUTCH-2224
 URL: https://issues.apache.org/jira/browse/NUTCH-2224
 Project: Nutch
  Issue Type: Bug
Reporter: Tien Nguyen Manh
Priority: Trivial


Currently we convert from bytes to kbits by
(bytes.get() / 125l)
I thinks it should be /128l



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2223) Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection

2016-02-17 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-2223:

Attachment: NUTCH-2223.patch

Patch for nutch 1.11

> Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection
> 
>
> Key: NUTCH-2223
> URL: https://issues.apache.org/jira/browse/NUTCH-2223
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
>Reporter: Tien Nguyen Manh
>Priority: Minor
> Attachments: NUTCH-2223.patch
>
>
> Stracktrace for the hang seems to be:
> at org.apache.xerces.impl.XMLScanner.scanExternalID(Unknown Source)
> at org.apache.xerces.impl.XMLDocumentScannerImpl.scanDoctypeDecl(Unknown 
> Source)
> at 
> org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher.dispatch(Unknown
>  Source)
> at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
> Source)
> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
> at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
> at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
> at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
> at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
> at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
> at 
> org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:54)
> at 
> org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:41)
> at org.apache.tika.mime.MimeTypes.getMimeType(MimeTypes.java:192)
> at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:439)
> at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
> at org.apache.tika.cli.TikaCLI$10.process(TikaCLI.java:252)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:417)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:111)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2223) Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection

2016-02-17 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-2223:

Fix Version/s: 1.13

> Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection
> 
>
> Key: NUTCH-2223
> URL: https://issues.apache.org/jira/browse/NUTCH-2223
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
>Reporter: Tien Nguyen Manh
>Priority: Minor
>
> Stracktrace for the hang seems to be:
> at org.apache.xerces.impl.XMLScanner.scanExternalID(Unknown Source)
> at org.apache.xerces.impl.XMLDocumentScannerImpl.scanDoctypeDecl(Unknown 
> Source)
> at 
> org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher.dispatch(Unknown
>  Source)
> at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
> Source)
> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
> at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
> at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
> at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
> at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
> at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
> at 
> org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:54)
> at 
> org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:41)
> at org.apache.tika.mime.MimeTypes.getMimeType(MimeTypes.java:192)
> at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:439)
> at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
> at org.apache.tika.cli.TikaCLI$10.process(TikaCLI.java:252)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:417)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:111)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2223) Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection

2016-02-17 Thread Tien Nguyen Manh (JIRA)
Tien Nguyen Manh created NUTCH-2223:
---

 Summary: Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika 
mimetype detection
 Key: NUTCH-2223
 URL: https://issues.apache.org/jira/browse/NUTCH-2223
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.11
Reporter: Tien Nguyen Manh
Priority: Minor


Stracktrace for the hang seems to be:
at org.apache.xerces.impl.XMLScanner.scanExternalID(Unknown Source)
at org.apache.xerces.impl.XMLDocumentScannerImpl.scanDoctypeDecl(Unknown Source)
at 
org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher.dispatch(Unknown 
Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
at 
org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:54)
at 
org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:41)
at org.apache.tika.mime.MimeTypes.getMimeType(MimeTypes.java:192)
at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:439)
at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
at org.apache.tika.cli.TikaCLI$10.process(TikaCLI.java:252)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:417)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:111)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2223) Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection

2016-02-17 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-2223:

Fix Version/s: (was: 1.13)

> Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection
> 
>
> Key: NUTCH-2223
> URL: https://issues.apache.org/jira/browse/NUTCH-2223
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
>Reporter: Tien Nguyen Manh
>Priority: Minor
>
> Stracktrace for the hang seems to be:
> at org.apache.xerces.impl.XMLScanner.scanExternalID(Unknown Source)
> at org.apache.xerces.impl.XMLDocumentScannerImpl.scanDoctypeDecl(Unknown 
> Source)
> at 
> org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher.dispatch(Unknown
>  Source)
> at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
> Source)
> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
> at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
> at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
> at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
> at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
> at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
> at 
> org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:54)
> at 
> org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:41)
> at org.apache.tika.mime.MimeTypes.getMimeType(MimeTypes.java:192)
> at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:439)
> at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
> at org.apache.tika.cli.TikaCLI$10.process(TikaCLI.java:252)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:417)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:111)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2016-01-26 Thread Tien Nguyen Manh (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15117020#comment-15117020
 ] 

Tien Nguyen Manh commented on NUTCH-961:


Can NUTCH-1233: use tika to extract outlink solve that problem?

> Expose Tika's boilerpipe support
> 
>
> Key: NUTCH-961
> URL: https://issues.apache.org/jira/browse/NUTCH-961
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Attachments: BoilerpipeExtractorRepository.java, 
> NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, 
> NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, 
> NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, 
> NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, 
> NUTCH-961v2.patch, nutch-2.x-boilerpipe.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to 
> extract boilerplate content from HTML pages. We should see how we can expose 
> Boilerplate in the Nutch cofiguration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (NUTCH-961) Expose Tika's boilerpipe support

2016-01-25 Thread Tien Nguyen Manh (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116772#comment-15116772
 ] 

Tien Nguyen Manh edited comment on NUTCH-961 at 1/26/16 6:57 AM:
-

AH yes, Could you explain why we need to parse it twice? with NUTCH-1233 we can 
use just 1 parse?


was (Author: tiennm):
AH yes, Could you explain why we need to parse it twice?

> Expose Tika's boilerpipe support
> 
>
> Key: NUTCH-961
> URL: https://issues.apache.org/jira/browse/NUTCH-961
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Attachments: BoilerpipeExtractorRepository.java, 
> NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, 
> NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, 
> NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, 
> NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, 
> NUTCH-961v2.patch, nutch-2.x-boilerpipe.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to 
> extract boilerplate content from HTML pages. We should see how we can expose 
> Boilerplate in the Nutch cofiguration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2016-01-25 Thread Tien Nguyen Manh (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116772#comment-15116772
 ] 

Tien Nguyen Manh commented on NUTCH-961:


AH yes, Could you explain why we need to parse it twice?

> Expose Tika's boilerpipe support
> 
>
> Key: NUTCH-961
> URL: https://issues.apache.org/jira/browse/NUTCH-961
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Attachments: BoilerpipeExtractorRepository.java, 
> NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, 
> NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, 
> NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, 
> NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, 
> NUTCH-961v2.patch, nutch-2.x-boilerpipe.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to 
> extract boilerplate content from HTML pages. We should see how we can expose 
> Boilerplate in the Nutch cofiguration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2016-01-24 Thread Tien Nguyen Manh (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15114658#comment-15114658
 ] 

Tien Nguyen Manh commented on NUTCH-961:


One note with boilerpipe support, it is significant slower than parse-html. I 
tested to parse the same segment and here are results
parse-html: 3hm, parse-tika with boilerpipe 5h10m and parse-tika without 
poilerpipe 4h.

> Expose Tika's boilerpipe support
> 
>
> Key: NUTCH-961
> URL: https://issues.apache.org/jira/browse/NUTCH-961
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Attachments: BoilerpipeExtractorRepository.java, 
> NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, 
> NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, 
> NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, 
> NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, 
> NUTCH-961v2.patch, nutch-2.x-boilerpipe.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to 
> extract boilerplate content from HTML pages. We should see how we can expose 
> Boilerplate in the Nutch cofiguration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2016-01-20 Thread Tien Nguyen Manh (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110217#comment-15110217
 ] 

Tien Nguyen Manh commented on NUTCH-961:


i'm using this patch NUTCH-961-1.11-1.patch, it works fine when run from 
eclipse & run in hadoop. It have problem when i run in local mode
It throws exception: "Can't retrieve Tika parser for mime-type text/html". It 
is not problem with parse-plugins.xml. It seem problem with TikaConfig 
constructor TikaConfig(ClassLoader loader), it failed to load some config via 
classLoader when run in local mode.

> Expose Tika's boilerpipe support
> 
>
> Key: NUTCH-961
> URL: https://issues.apache.org/jira/browse/NUTCH-961
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Attachments: BoilerpipeExtractorRepository.java, 
> NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, 
> NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, 
> NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, 
> NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, 
> NUTCH-961v2.patch, nutch-2.x-boilerpipe.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to 
> extract boilerplate content from HTML pages. We should see how we can expose 
> Boilerplate in the Nutch cofiguration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.

2015-08-23 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1679:

Attachment: NUTCH-1679-2.patch

I have another solution.
With a new link in DbUpdaterReducer we only add url, no status, no fetchtime or 
any other info. 
 - So if this link is already exist in database, we don't override anything. 
 - Otherwise, it is actually a new link, it will have status = 0 (default 
value)  and we will initialize (set status, fetch time, ...) it in Generator 
instead.
I tested it with hbase backend on nutch-2.3

 UpdateDb using batchId, link may override crawled page.
 ---

 Key: NUTCH-1679
 URL: https://issues.apache.org/jira/browse/NUTCH-1679
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.2.1
Reporter: Tien Nguyen Manh
Priority: Critical
 Fix For: 2.3.1

 Attachments: NUTCH-1679-2.patch, NUTCH-1679.patch


 The problem is in Hbase store, not sure about other store.
 Suppose at first crawl cycle we crawl link A, then get an outlink B.
 In second cycle we crawl link B which also has a link point to A
 In second updatedb we load only page B from store, and will add A as new link 
 because it doesn't know A already exist in store and will override A.
 UpdateDb must be run without batchId or we must set additionsAllowed=false
 Here are code for new page
   page = new WebPage();
   schedule.initializeSchedule(url, page);
   page.setStatus(CrawlStatus.STATUS_UNFETCHED);
   try {
 scoringFilters.initialScore(url, page);
   } catch (ScoringFilterException e) {
 page.setScore(0.0f);
   }
 new page will override old page status, score, fetchTime, fetchInterval, 
 retries, metadata[CASH_KEY]
  - i think we can change something here so that new page will only update one 
 column for example 'link' and if it is really a new page, we can initialize 
 all above fields in generator
 - or we add operator checkAndPut to store so when add new page we will check 
 if already exist first



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-1702) Port HostNormalizer to 2.x

2014-01-15 Thread Tien Nguyen Manh (JIRA)
Tien Nguyen Manh created NUTCH-1702:
---

 Summary: Port HostNormalizer to 2.x
 Key: NUTCH-1702
 URL: https://issues.apache.org/jira/browse/NUTCH-1702
 Project: Nutch
  Issue Type: Improvement
Reporter: Tien Nguyen Manh


Port NUTCH-1319 to 2.x



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1702) Port HostNormalizer to 2.x

2014-01-15 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1702:


Attachment: NUTCH-1702.patch

 Port HostNormalizer to 2.x
 --

 Key: NUTCH-1702
 URL: https://issues.apache.org/jira/browse/NUTCH-1702
 Project: Nutch
  Issue Type: Improvement
Reporter: Tien Nguyen Manh
 Fix For: 2.3

 Attachments: NUTCH-1702.patch


 Port NUTCH-1319 to 2.x



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1702) Port HostNormalizer to 2.x

2014-01-15 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1702:


Fix Version/s: 2.3

 Port HostNormalizer to 2.x
 --

 Key: NUTCH-1702
 URL: https://issues.apache.org/jira/browse/NUTCH-1702
 Project: Nutch
  Issue Type: Improvement
Reporter: Tien Nguyen Manh
 Fix For: 2.3

 Attachments: NUTCH-1702.patch


 Port NUTCH-1319 to 2.x



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1702) Port HostNormalizer to 2.x

2014-01-15 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1702:


Attachment: NUTCH-1702.patch

 Port HostNormalizer to 2.x
 --

 Key: NUTCH-1702
 URL: https://issues.apache.org/jira/browse/NUTCH-1702
 Project: Nutch
  Issue Type: Improvement
Reporter: Tien Nguyen Manh
 Fix For: 2.3

 Attachments: NUTCH-1702.patch


 Port NUTCH-1319 to 2.x



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1702) Port HostNormalizer to 2.x

2014-01-15 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1702:


Attachment: (was: NUTCH-1702.patch)

 Port HostNormalizer to 2.x
 --

 Key: NUTCH-1702
 URL: https://issues.apache.org/jira/browse/NUTCH-1702
 Project: Nutch
  Issue Type: Improvement
Reporter: Tien Nguyen Manh
 Fix For: 2.3

 Attachments: NUTCH-1702.patch


 Port NUTCH-1319 to 2.x



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1704) Port DomainBlacklist urlfilter to 2.x

2014-01-15 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1704:


Attachment: NUTCH-1704.patch

 Port DomainBlacklist urlfilter to 2.x
 -

 Key: NUTCH-1704
 URL: https://issues.apache.org/jira/browse/NUTCH-1704
 Project: Nutch
  Issue Type: Improvement
Reporter: Tien Nguyen Manh
 Attachments: NUTCH-1704.patch


 Port NUTCH-1210 to 2.x



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1478) Parse-metatags and index-metadata plugin for Nutch 2.x series

2014-01-15 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1478:


Attachment: NUTCH-1478-parse-v2.patch

i port parse-metatags to 2.x, this patch support multi-value in metatags.

 Parse-metatags and index-metadata plugin for Nutch 2.x series 
 --

 Key: NUTCH-1478
 URL: https://issues.apache.org/jira/browse/NUTCH-1478
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 2.1
Reporter: kiran
 Fix For: 2.3

 Attachments: NUTCH-1478-parse-v2.patch, Nutch1478.patch, 
 Nutch1478.zip, metadata_parseChecker_sites.png


 I have ported parse-metatags and index-metadata plugin to Nutch 2.x series.  
 This will take multiple values of same tag and index in Solr as i patched 
 before (https://issues.apache.org/jira/browse/NUTCH-1467).
 The usage is same as described here 
 (http://wiki.apache.org/nutch/IndexMetatags) but one change is that there is 
 no need to give 'metatag' keyword before metatag names. For example my 
 configuration looks like this 
 (https://github.com/salvager/NutchDev/blob/master/runtime/local/conf/nutch-site.xml)
  
 This is only the first version and does not include the junit test. I will 
 update the new version soon.
 This will parse the tags and index the tags in Solr. Make sure you create the 
 fields in 'index.parse.md' in nutch-site.xml in schema.xml in Solr.
 Please let me know if you have any suggestions
 This is supported by DLA (Digital Library and Archives) of Virginia Tech.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1705) Make configuration option for HtmlParser TikaParser to extract text or title for noIndex page

2014-01-15 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1705:


Attachment: NUTCH-1705.patch

 Make configuration option for HtmlParser  TikaParser to extract text or 
 title for noIndex page
 ---

 Key: NUTCH-1705
 URL: https://issues.apache.org/jira/browse/NUTCH-1705
 Project: Nutch
  Issue Type: Improvement
Reporter: Tien Nguyen Manh
Priority: Minor
 Attachments: NUTCH-1705.patch


 Currently HtmlParser and TikaParser always skip extracting text and title for 
 noIndex page - page which have noIndex robots metatags.
 But some parse-filter may still interested in text and title such as 
 NUTCH-1661, where we may decide wether to follow a page by it's language.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (NUTCH-1705) Make configuration option for HtmlParser TikaParser to extract text or title for noIndex page

2014-01-15 Thread Tien Nguyen Manh (JIRA)
Tien Nguyen Manh created NUTCH-1705:
---

 Summary: Make configuration option for HtmlParser  TikaParser to 
extract text or title for noIndex page
 Key: NUTCH-1705
 URL: https://issues.apache.org/jira/browse/NUTCH-1705
 Project: Nutch
  Issue Type: Improvement
Reporter: Tien Nguyen Manh
Priority: Minor


Currently HtmlParser and TikaParser always skip extracting text and title for 
noIndex page - page which have noIndex robots metatags.
But some parse-filter may still interested in text and title such as 
NUTCH-1661, where we may decide wether to follow a page by it's language.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (NUTCH-1701) Make Solr Document Boost as an option

2014-01-14 Thread Tien Nguyen Manh (JIRA)
Tien Nguyen Manh created NUTCH-1701:
---

 Summary: Make Solr Document Boost as an option
 Key: NUTCH-1701
 URL: https://issues.apache.org/jira/browse/NUTCH-1701
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Tien Nguyen Manh
Priority: Minor


Nutch SolrIndexer use Nutch score as document boost by default. We should make 
it as an option because we can use nutch score to boost in different way such 
as boost at query time via function query



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1701) Make Solr Document Boost as an option

2014-01-14 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1701:


Fix Version/s: 1.8
   2.3

 Make Solr Document Boost as an option
 -

 Key: NUTCH-1701
 URL: https://issues.apache.org/jira/browse/NUTCH-1701
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Tien Nguyen Manh
Priority: Minor
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1701-2x.patch


 Nutch SolrIndexer use Nutch score as document boost by default. We should 
 make it as an option because we can use nutch score to boost in different way 
 such as boost at query time via function query



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1701) Make Solr Document Boost as an option

2014-01-14 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1701:


Attachment: NUTCH-1701-2x.patch

 Make Solr Document Boost as an option
 -

 Key: NUTCH-1701
 URL: https://issues.apache.org/jira/browse/NUTCH-1701
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Tien Nguyen Manh
Priority: Minor
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1701-2x.patch


 Nutch SolrIndexer use Nutch score as document boost by default. We should 
 make it as an option because we can use nutch score to boost in different way 
 such as boost at query time via function query



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1686) Optimize UpdateDb to load less field from Store

2014-01-02 Thread Tien Nguyen Manh (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13861142#comment-13861142
 ] 

Tien Nguyen Manh commented on NUTCH-1686:
-

In this patch i also fixed an bug with fetchTime. Currently each time we run 
whole updatedb, fetchTime is increased again for all urls.

 Optimize UpdateDb to load less field from Store
 ---

 Key: NUTCH-1686
 URL: https://issues.apache.org/jira/browse/NUTCH-1686
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.3
Reporter: Tien Nguyen Manh
 Fix For: 2.3

 Attachments: NUTCH-1686.patch


 While running large crawl i found that updatedb run very slow, especially the 
 Map task which loading data from store.
 We can't use filter by batchId to load less url due to bug in NUTCH-1679 so 
 we must always update the whole table.
 After checking the field loaded in UpdateDbJob i found that it load many 
 fields from store (at least 15/25 field) which make updatedb slow
 I think that UpdateDbJob only need to load few field: SCORE, OUTLINKS, 
 METADATA which is used to compute link score, distance that i think the main 
 purpose of this job.
 The other fields is used to compute url schedule to parser and fetcher, we 
 can move code to Parser or Fetcher whithout loading much new field because 
 many field are generated from parser. WE can also use gora filter for Fetcher 
 or Parser so load new field is not a problem.
 I also add new field SCOREMETA to WebPage to store CASH, and DISTANCE. It is 
 currently store in METADATA. field CASH is used in OPICScoring which is used 
 only in UpdateDB and distance is used only in Generator and Updater so move 
 both field two new Metadata field can prevent reading METADATA at Generator 
 and Updater, METADATA contains many data that is used only at Parser and 
 Indexer
 So with new change
 UpdateDb only load SCORE, SCOREMATA (CASH, DISTANCE), OUTLINKS, MAKERS: we 
 don't need to load big family Fetch and INLINKS.
 Generator only load SCOREMETA (which is smaller than current METADATA)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1693) TextMD5Signatue compute on textual content

2014-01-02 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1693:


Issue Type: New Feature  (was: Bug)

 TextMD5Signatue compute on textual content
 --

 Key: NUTCH-1693
 URL: https://issues.apache.org/jira/browse/NUTCH-1693
 Project: Nutch
  Issue Type: New Feature
Reporter: Tien Nguyen Manh
Priority: Minor
 Fix For: 2.3

 Attachments: NUTCH-1693.patch


 I create a new MD5Signature that based on textual content. In our case we use 
 boilerpipe to extract main text from content so this signature is more 
 effective to deduplicate.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1693) TextMD5Signatue compute on textual content

2014-01-02 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1693:


Fix Version/s: 2.3

 TextMD5Signatue compute on textual content
 --

 Key: NUTCH-1693
 URL: https://issues.apache.org/jira/browse/NUTCH-1693
 Project: Nutch
  Issue Type: New Feature
Reporter: Tien Nguyen Manh
Priority: Minor
 Fix For: 2.3

 Attachments: NUTCH-1693.patch


 I create a new MD5Signature that based on textual content. In our case we use 
 boilerpipe to extract main text from content so this signature is more 
 effective to deduplicate.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1693) TextMD5Signatue compute on textual content

2014-01-02 Thread Tien Nguyen Manh (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13861195#comment-13861195
 ] 

Tien Nguyen Manh commented on NUTCH-1693:
-

this patch only work with a minor change that compute signature after seting 
text to page that i made in NUTCH-1686

 TextMD5Signatue compute on textual content
 --

 Key: NUTCH-1693
 URL: https://issues.apache.org/jira/browse/NUTCH-1693
 Project: Nutch
  Issue Type: New Feature
Reporter: Tien Nguyen Manh
Priority: Minor
 Fix For: 2.3

 Attachments: NUTCH-1693.patch


 I create a new MD5Signature that based on textual content. In our case we use 
 boilerpipe to extract main text from content so this signature is more 
 effective to deduplicate.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1687) Pick queue in Round Robin

2013-12-30 Thread Tien Nguyen Manh (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13859364#comment-13859364
 ] 

Tien Nguyen Manh commented on NUTCH-1687:
-

It is nice!

 Pick queue in Round Robin
 -

 Key: NUTCH-1687
 URL: https://issues.apache.org/jira/browse/NUTCH-1687
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: Tien Nguyen Manh
Priority: Minor
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1687.patch, NUTCH-1687.tejasp.v1.patch


 Currently we chose queue to pick url from start of queues list, so queue at 
 the start of list have more change to be pick first, that can cause problem 
 of long tail queue, which only few queue available at the end which have many 
 urls.
 public synchronized FetchItem getFetchItem() {
   final IteratorMap.EntryString, FetchItemQueue it =
 queues.entrySet().iterator(); == always reset to find queue from 
 start
   while (it.hasNext()) {
 
 I think it is better to pick queue in round robin, that can make reduce time 
 to find the available queue and make all queue was picked in round robin and 
 if we use TopN during generator there are no long tail queue at the end.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1687) Pick queue in Round Robin

2013-12-29 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1687:


Attachment: NUTCH-1687.patch

add Apache Header
fixed lost tail pointer when deleting

 Pick queue in Round Robin
 -

 Key: NUTCH-1687
 URL: https://issues.apache.org/jira/browse/NUTCH-1687
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: Tien Nguyen Manh
Priority: Minor
 Fix For: 2.3

 Attachments: NUTCH-1687.patch


 Currently we chose queue to pick url from start of queues list, so queue at 
 the start of list have more change to be pick first, that can cause problem 
 of long tail queue, which only few queue available at the end which have many 
 urls.
 public synchronized FetchItem getFetchItem() {
   final IteratorMap.EntryString, FetchItemQueue it =
 queues.entrySet().iterator(); == always reset to find queue from 
 start
   while (it.hasNext()) {
 
 I think it is better to pick queue in round robin, that can make reduce time 
 to find the available queue and make all queue was picked in round robin and 
 if we use TopN during generator there are no long tail queue at the end.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1687) Pick queue in Round Robin

2013-12-29 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1687:


Attachment: (was: NUTCH-1687.patch)

 Pick queue in Round Robin
 -

 Key: NUTCH-1687
 URL: https://issues.apache.org/jira/browse/NUTCH-1687
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: Tien Nguyen Manh
Priority: Minor
 Fix For: 2.3

 Attachments: NUTCH-1687.patch


 Currently we chose queue to pick url from start of queues list, so queue at 
 the start of list have more change to be pick first, that can cause problem 
 of long tail queue, which only few queue available at the end which have many 
 urls.
 public synchronized FetchItem getFetchItem() {
   final IteratorMap.EntryString, FetchItemQueue it =
 queues.entrySet().iterator(); == always reset to find queue from 
 start
   while (it.hasNext()) {
 
 I think it is better to pick queue in round robin, that can make reduce time 
 to find the available queue and make all queue was picked in round robin and 
 if we use TopN during generator there are no long tail queue at the end.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)