from:"\"Tien Nguyen Manh \\\(JIRA\\\)\""

[jira] [Commented] (NUTCH-2236) Upgrade to Hadoop 2.7.1

2016-02-29 Thread Tien Nguyen Manh (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15171725#comment-15171725
 ] 

Tien Nguyen Manh commented on NUTCH-2236:
-

No problem, just to make it run on Hadoop 2.7.1

> Upgrade to Hadoop 2.7.1
> ---
>
> Key: NUTCH-2236
> URL: https://issues.apache.org/jira/browse/NUTCH-2236
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.11
>Reporter: Tien Nguyen Manh
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2236.patch
>
>
> Upgrade to Hadoop 2.7.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2234) Upgrade to elasticsearch 2.1.1

2016-02-28 Thread Tien Nguyen Manh (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15171264#comment-15171264
 ] 

Tien Nguyen Manh commented on NUTCH-2234:
-

elasticsearch 2.1.1 use httpclient 4.3.6

> Upgrade to elasticsearch 2.1.1
> --
>
> Key: NUTCH-2234
> URL: https://issues.apache.org/jira/browse/NUTCH-2234
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.11
>Reporter: Tien Nguyen Manh
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2234.patch
>
>
> Currently we use elasticsearch 1.x, We should upgrade to 2.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-2236) Upgrade to Hadoop 2.7.1

2016-02-28 Thread Tien Nguyen Manh (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-2236:

Attachment: NUTCH-2236.patch

I run Nutch 1.11 on Hadoop 2.7.1 with this patch.
We also need add this line to etc/hadoop/mapred-env.sh
export HADOOP_USER_CLASSPATH_FIRST=true


> Upgrade to Hadoop 2.7.1
> ---
>
> Key: NUTCH-2236
> URL: https://issues.apache.org/jira/browse/NUTCH-2236
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.11
>Reporter: Tien Nguyen Manh
> Attachments: NUTCH-2236.patch
>
>
> Upgrade to Hadoop 2.7.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (NUTCH-2236) Upgrade to Hadoop 2.7.1

2016-02-28 Thread Tien Nguyen Manh (JIRA)

Tien Nguyen Manh created NUTCH-2236:
---

 Summary: Upgrade to Hadoop 2.7.1
 Key: NUTCH-2236
 URL: https://issues.apache.org/jira/browse/NUTCH-2236
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.11
Reporter: Tien Nguyen Manh


Upgrade to Hadoop 2.7.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-2234) Upgrade to elasticsearch 2.1.1

2016-02-24 Thread Tien Nguyen Manh (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-2234:

Attachment: NUTCH-2234.patch

> Upgrade to elasticsearch 2.1.1
> --
>
> Key: NUTCH-2234
> URL: https://issues.apache.org/jira/browse/NUTCH-2234
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.11
>Reporter: Tien Nguyen Manh
> Attachments: NUTCH-2234.patch
>
>
> Currently we use elasticsearch 1.x, We should upgrade to 2.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-1687) Pick queue in Round Robin

2016-02-24 Thread Tien Nguyen Manh (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1687:

Attachment: NUTCH-1687-2.patch

Here it is:
I update my initial patch for version 1.11.
I crawl large number of hosts, so using circular linked list prevents creating 
new iterator every time a new hosts is added which happens quite frequent.

> Pick queue in Round Robin
> -
>
> Key: NUTCH-1687
> URL: https://issues.apache.org/jira/browse/NUTCH-1687
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Tien Nguyen Manh
>Priority: Minor
> Attachments: NUTCH-1687-2.patch, NUTCH-1687.patch, 
> NUTCH-1687.tejasp.v1.patch
>
>
> Currently we chose queue to pick url from start of queues list, so queue at 
> the start of list have more change to be pick first, that can cause problem 
> of long tail queue, which only few queue available at the end which have many 
> urls.
> public synchronized FetchItem getFetchItem() {
>   final Iterator> it =
> queues.entrySet().iterator(); ==> always reset to find queue from 
> start
>   while (it.hasNext()) {
> 
> I think it is better to pick queue in round robin, that can make reduce time 
> to find the available queue and make all queue was picked in round robin and 
> if we use TopN during generator there are no long tail queue at the end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-1687) Pick queue in Round Robin

2016-02-24 Thread Tien Nguyen Manh (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1687:

Attachment: (was: NUTCH-1687-2.patch)

> Pick queue in Round Robin
> -
>
> Key: NUTCH-1687
> URL: https://issues.apache.org/jira/browse/NUTCH-1687
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Tien Nguyen Manh
>Priority: Minor
> Attachments: NUTCH-1687.patch, NUTCH-1687.tejasp.v1.patch
>
>
> Currently we chose queue to pick url from start of queues list, so queue at 
> the start of list have more change to be pick first, that can cause problem 
> of long tail queue, which only few queue available at the end which have many 
> urls.
> public synchronized FetchItem getFetchItem() {
>   final Iterator> it =
> queues.entrySet().iterator(); ==> always reset to find queue from 
> start
>   while (it.hasNext()) {
> 
> I think it is better to pick queue in round robin, that can make reduce time 
> to find the available queue and make all queue was picked in round robin and 
> if we use TopN during generator there are no long tail queue at the end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Issue Comment Deleted] (NUTCH-1687) Pick queue in Round Robin

2016-02-24 Thread Tien Nguyen Manh (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1687:

Comment: was deleted

(was: I update my initial patch for ver 1.11.
I crawl large number of hosts, so using circular linked list prevents creating 
new iterator every time a new hosts is added.)

> Pick queue in Round Robin
> -
>
> Key: NUTCH-1687
> URL: https://issues.apache.org/jira/browse/NUTCH-1687
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Tien Nguyen Manh
>Priority: Minor
> Attachments: NUTCH-1687.patch, NUTCH-1687.tejasp.v1.patch
>
>
> Currently we chose queue to pick url from start of queues list, so queue at 
> the start of list have more change to be pick first, that can cause problem 
> of long tail queue, which only few queue available at the end which have many 
> urls.
> public synchronized FetchItem getFetchItem() {
>   final Iterator> it =
> queues.entrySet().iterator(); ==> always reset to find queue from 
> start
>   while (it.hasNext()) {
> 
> I think it is better to pick queue in round robin, that can make reduce time 
> to find the available queue and make all queue was picked in round robin and 
> if we use TopN during generator there are no long tail queue at the end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-2234) Upgrade to elasticsearch 2.1.1

2016-02-24 Thread Tien Nguyen Manh (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-2234:

Attachment: (was: NUTCH-2234.patch)

> Upgrade to elasticsearch 2.1.1
> --
>
> Key: NUTCH-2234
> URL: https://issues.apache.org/jira/browse/NUTCH-2234
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.11
>Reporter: Tien Nguyen Manh
>
> Currently we use elasticsearch 1.x, We should upgrade to 2.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-2234) Upgrade to elasticsearch 2.1.1

2016-02-24 Thread Tien Nguyen Manh (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-2234:

Attachment: NUTCH-2234.patch

> Upgrade to elasticsearch 2.1.1
> --
>
> Key: NUTCH-2234
> URL: https://issues.apache.org/jira/browse/NUTCH-2234
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.11
>Reporter: Tien Nguyen Manh
> Attachments: NUTCH-2234.patch
>
>
> Currently we use elasticsearch 1.x, We should upgrade to 2.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (NUTCH-2234) Upgrade to elasticsearch 2.1.1

2016-02-24 Thread Tien Nguyen Manh (JIRA)

Tien Nguyen Manh created NUTCH-2234:
---

 Summary: Upgrade to elasticsearch 2.1.1
 Key: NUTCH-2234
 URL: https://issues.apache.org/jira/browse/NUTCH-2234
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.11
Reporter: Tien Nguyen Manh


Currently we use elasticsearch 1.x, We should upgrade to 2.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-1687) Pick queue in Round Robin

2016-02-24 Thread Tien Nguyen Manh (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1687:

Attachment: NUTCH-1687-2.patch

I update my initial patch for ver 1.11.
I crawl large number of hosts, so using circular linked list prevents creating 
new iterator every time a new hosts is added.

> Pick queue in Round Robin
> -
>
> Key: NUTCH-1687
> URL: https://issues.apache.org/jira/browse/NUTCH-1687
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Tien Nguyen Manh
>Priority: Minor
> Attachments: NUTCH-1687-2.patch, NUTCH-1687.patch, 
> NUTCH-1687.tejasp.v1.patch
>
>
> Currently we chose queue to pick url from start of queues list, so queue at 
> the start of list have more change to be pick first, that can cause problem 
> of long tail queue, which only few queue available at the end which have many 
> urls.
> public synchronized FetchItem getFetchItem() {
>   final Iterator> it =
> queues.entrySet().iterator(); ==> always reset to find queue from 
> start
>   while (it.hasNext()) {
> 
> I think it is better to pick queue in round robin, that can make reduce time 
> to find the available queue and make all queue was picked in round robin and 
> if we use TopN during generator there are no long tail queue at the end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-2225) Parsed time not include time to parse

2016-02-17 Thread Tien Nguyen Manh (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-2225:

Attachment: NUTCH-2225.patch

> Parsed time not include time to parse
> -
>
> Key: NUTCH-2225
> URL: https://issues.apache.org/jira/browse/NUTCH-2225
> Project: Nutch
>  Issue Type: Bug
>Reporter: Tien Nguyen Manh
>Priority: Trivial
> Attachments: NUTCH-2225.patch
>
>
> In ParseSegment we report parse time
> LOG.info("Parsed (" + Long.toString(end - start) + "ms):" + url);
> But the start time is the time after we parse so in log we see many "0 ms"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (NUTCH-2225) Parsed time not include time to parse

2016-02-17 Thread Tien Nguyen Manh (JIRA)

Tien Nguyen Manh created NUTCH-2225:
---

 Summary: Parsed time not include time to parse
 Key: NUTCH-2225
 URL: https://issues.apache.org/jira/browse/NUTCH-2225
 Project: Nutch
  Issue Type: Bug
Reporter: Tien Nguyen Manh
Priority: Trivial


In ParseSegment we report parse time
LOG.info("Parsed (" + Long.toString(end - start) + "ms):" + url);
But the start time is the time after we parse so in log we see many "0 ms"




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-2224) Wrong metric compute in Fetcher status report

2016-02-17 Thread Tien Nguyen Manh (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-2224:

Attachment: NUTCH-2224.patch

> Wrong metric compute in Fetcher status report
> -
>
> Key: NUTCH-2224
> URL: https://issues.apache.org/jira/browse/NUTCH-2224
> Project: Nutch
>  Issue Type: Bug
>Reporter: Tien Nguyen Manh
>Priority: Trivial
> Attachments: NUTCH-2224.patch
>
>
> Currently we convert from bytes to kbits by
> (bytes.get() / 125l)
> I thinks it should be /128l



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (NUTCH-2224) Wrong metric compute in Fetcher status report

2016-02-17 Thread Tien Nguyen Manh (JIRA)

Tien Nguyen Manh created NUTCH-2224:
---

 Summary: Wrong metric compute in Fetcher status report
 Key: NUTCH-2224
 URL: https://issues.apache.org/jira/browse/NUTCH-2224
 Project: Nutch
  Issue Type: Bug
Reporter: Tien Nguyen Manh
Priority: Trivial


Currently we convert from bytes to kbits by
(bytes.get() / 125l)
I thinks it should be /128l



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-2223) Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection

2016-02-17 Thread Tien Nguyen Manh (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-2223:

Attachment: NUTCH-2223.patch

Patch for nutch 1.11

> Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection
> 
>
> Key: NUTCH-2223
> URL: https://issues.apache.org/jira/browse/NUTCH-2223
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
>Reporter: Tien Nguyen Manh
>Priority: Minor
> Attachments: NUTCH-2223.patch
>
>
> Stracktrace for the hang seems to be:
> at org.apache.xerces.impl.XMLScanner.scanExternalID(Unknown Source)
> at org.apache.xerces.impl.XMLDocumentScannerImpl.scanDoctypeDecl(Unknown 
> Source)
> at 
> org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher.dispatch(Unknown
>  Source)
> at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
> Source)
> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
> at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
> at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
> at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
> at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
> at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
> at 
> org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:54)
> at 
> org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:41)
> at org.apache.tika.mime.MimeTypes.getMimeType(MimeTypes.java:192)
> at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:439)
> at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
> at org.apache.tika.cli.TikaCLI$10.process(TikaCLI.java:252)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:417)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:111)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-2223) Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection

2016-02-17 Thread Tien Nguyen Manh (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-2223:

Fix Version/s: 1.13

> Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection
> 
>
> Key: NUTCH-2223
> URL: https://issues.apache.org/jira/browse/NUTCH-2223
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
>Reporter: Tien Nguyen Manh
>Priority: Minor
>
> Stracktrace for the hang seems to be:
> at org.apache.xerces.impl.XMLScanner.scanExternalID(Unknown Source)
> at org.apache.xerces.impl.XMLDocumentScannerImpl.scanDoctypeDecl(Unknown 
> Source)
> at 
> org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher.dispatch(Unknown
>  Source)
> at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
> Source)
> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
> at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
> at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
> at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
> at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
> at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
> at 
> org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:54)
> at 
> org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:41)
> at org.apache.tika.mime.MimeTypes.getMimeType(MimeTypes.java:192)
> at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:439)
> at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
> at org.apache.tika.cli.TikaCLI$10.process(TikaCLI.java:252)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:417)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:111)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (NUTCH-2223) Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection

2016-02-17 Thread Tien Nguyen Manh (JIRA)

Tien Nguyen Manh created NUTCH-2223:
---

 Summary: Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika 
mimetype detection
 Key: NUTCH-2223
 URL: https://issues.apache.org/jira/browse/NUTCH-2223
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.11
Reporter: Tien Nguyen Manh
Priority: Minor


Stracktrace for the hang seems to be:
at org.apache.xerces.impl.XMLScanner.scanExternalID(Unknown Source)
at org.apache.xerces.impl.XMLDocumentScannerImpl.scanDoctypeDecl(Unknown Source)
at 
org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher.dispatch(Unknown 
Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
at 
org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:54)
at 
org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:41)
at org.apache.tika.mime.MimeTypes.getMimeType(MimeTypes.java:192)
at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:439)
at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
at org.apache.tika.cli.TikaCLI$10.process(TikaCLI.java:252)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:417)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:111)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-2223) Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection

2016-02-17 Thread Tien Nguyen Manh (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-2223:

Fix Version/s: (was: 1.13)

> Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection
> 
>
> Key: NUTCH-2223
> URL: https://issues.apache.org/jira/browse/NUTCH-2223
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
>Reporter: Tien Nguyen Manh
>Priority: Minor
>
> Stracktrace for the hang seems to be:
> at org.apache.xerces.impl.XMLScanner.scanExternalID(Unknown Source)
> at org.apache.xerces.impl.XMLDocumentScannerImpl.scanDoctypeDecl(Unknown 
> Source)
> at 
> org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher.dispatch(Unknown
>  Source)
> at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
> Source)
> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
> at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
> at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
> at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
> at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
> at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
> at 
> org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:54)
> at 
> org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:41)
> at org.apache.tika.mime.MimeTypes.getMimeType(MimeTypes.java:192)
> at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:439)
> at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
> at org.apache.tika.cli.TikaCLI$10.process(TikaCLI.java:252)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:417)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:111)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2016-01-26 Thread Tien Nguyen Manh (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15117020#comment-15117020
 ] 

Tien Nguyen Manh commented on NUTCH-961:


Can NUTCH-1233: use tika to extract outlink solve that problem?

> Expose Tika's boilerpipe support
> 
>
> Key: NUTCH-961
> URL: https://issues.apache.org/jira/browse/NUTCH-961
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Attachments: BoilerpipeExtractorRepository.java, 
> NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, 
> NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, 
> NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, 
> NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, 
> NUTCH-961v2.patch, nutch-2.x-boilerpipe.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to 
> extract boilerplate content from HTML pages. We should see how we can expose 
> Boilerplate in the Nutch cofiguration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (NUTCH-961) Expose Tika's boilerpipe support

2016-01-25 Thread Tien Nguyen Manh (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15116772#comment-15116772
 ] 

Tien Nguyen Manh edited comment on NUTCH-961 at 1/26/16 6:57 AM:
-

AH yes, Could you explain why we need to parse it twice? with NUTCH-1233 we can 
use just 1 parse?


was (Author: tiennm):
AH yes, Could you explain why we need to parse it twice?

> Expose Tika's boilerpipe support
> 
>
> Key: NUTCH-961
> URL: https://issues.apache.org/jira/browse/NUTCH-961
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Attachments: BoilerpipeExtractorRepository.java, 
> NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, 
> NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, 
> NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, 
> NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, 
> NUTCH-961v2.patch, nutch-2.x-boilerpipe.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to 
> extract boilerplate content from HTML pages. We should see how we can expose 
> Boilerplate in the Nutch cofiguration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2016-01-25 Thread Tien Nguyen Manh (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15116772#comment-15116772
 ] 

Tien Nguyen Manh commented on NUTCH-961:


AH yes, Could you explain why we need to parse it twice?

> Expose Tika's boilerpipe support
> 
>
> Key: NUTCH-961
> URL: https://issues.apache.org/jira/browse/NUTCH-961
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Attachments: BoilerpipeExtractorRepository.java, 
> NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, 
> NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, 
> NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, 
> NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, 
> NUTCH-961v2.patch, nutch-2.x-boilerpipe.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to 
> extract boilerplate content from HTML pages. We should see how we can expose 
> Boilerplate in the Nutch cofiguration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2016-01-24 Thread Tien Nguyen Manh (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15114658#comment-15114658
 ] 

Tien Nguyen Manh commented on NUTCH-961:


One note with boilerpipe support, it is significant slower than parse-html. I 
tested to parse the same segment and here are results
parse-html: 3hm, parse-tika with boilerpipe 5h10m and parse-tika without 
poilerpipe 4h.

> Expose Tika's boilerpipe support
> 
>
> Key: NUTCH-961
> URL: https://issues.apache.org/jira/browse/NUTCH-961
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Attachments: BoilerpipeExtractorRepository.java, 
> NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, 
> NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, 
> NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, 
> NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, 
> NUTCH-961v2.patch, nutch-2.x-boilerpipe.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to 
> extract boilerplate content from HTML pages. We should see how we can expose 
> Boilerplate in the Nutch cofiguration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2016-01-20 Thread Tien Nguyen Manh (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15110217#comment-15110217
 ] 

Tien Nguyen Manh commented on NUTCH-961:


i'm using this patch NUTCH-961-1.11-1.patch, it works fine when run from 
eclipse & run in hadoop. It have problem when i run in local mode
It throws exception: "Can't retrieve Tika parser for mime-type text/html". It 
is not problem with parse-plugins.xml. It seem problem with TikaConfig 
constructor TikaConfig(ClassLoader loader), it failed to load some config via 
classLoader when run in local mode.

> Expose Tika's boilerpipe support
> 
>
> Key: NUTCH-961
> URL: https://issues.apache.org/jira/browse/NUTCH-961
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Attachments: BoilerpipeExtractorRepository.java, 
> NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, 
> NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, 
> NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, 
> NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, 
> NUTCH-961v2.patch, nutch-2.x-boilerpipe.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to 
> extract boilerplate content from HTML pages. We should see how we can expose 
> Boilerplate in the Nutch cofiguration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.

2015-08-23 Thread Tien Nguyen Manh (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1679:

Attachment: NUTCH-1679-2.patch

I have another solution.
With a new link in DbUpdaterReducer we only add url, no status, no fetchtime or 
any other info. 
 - So if this link is already exist in database, we don't override anything. 
 - Otherwise, it is actually a new link, it will have status = 0 (default 
value)  and we will initialize (set status, fetch time, ...) it in Generator 
instead.
I tested it with hbase backend on nutch-2.3

> UpdateDb using batchId, link may override crawled page.
> ---
>
> Key: NUTCH-1679
> URL: https://issues.apache.org/jira/browse/NUTCH-1679
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.2.1
>Reporter: Tien Nguyen Manh
>Priority: Critical
> Fix For: 2.3.1
>
> Attachments: NUTCH-1679-2.patch, NUTCH-1679.patch
>
>
> The problem is in Hbase store, not sure about other store.
> Suppose at first crawl cycle we crawl link A, then get an outlink B.
> In second cycle we crawl link B which also has a link point to A
> In second updatedb we load only page B from store, and will add A as new link 
> because it doesn't know A already exist in store and will override A.
> UpdateDb must be run without batchId or we must set additionsAllowed=false
> Here are code for new page
>   page = new WebPage();
>   schedule.initializeSchedule(url, page);
>   page.setStatus(CrawlStatus.STATUS_UNFETCHED);
>   try {
> scoringFilters.initialScore(url, page);
>   } catch (ScoringFilterException e) {
> page.setScore(0.0f);
>   }
> new page will override old page status, score, fetchTime, fetchInterval, 
> retries, metadata[CASH_KEY]
>  - i think we can change something here so that new page will only update one 
> column for example 'link' and if it is really a new page, we can initialize 
> all above fields in generator
> - or we add operator checkAndPut to store so when add new page we will check 
> if already exist first



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (NUTCH-1705) Make configuration option for HtmlParser & TikaParser to extract text or title for noIndex page

2014-01-15 Thread Tien Nguyen Manh (JIRA)

Tien Nguyen Manh created NUTCH-1705:
---

 Summary: Make configuration option for HtmlParser & TikaParser to 
extract text or title for noIndex page
 Key: NUTCH-1705
 URL: https://issues.apache.org/jira/browse/NUTCH-1705
 Project: Nutch
  Issue Type: Improvement
Reporter: Tien Nguyen Manh
Priority: Minor


Currently HtmlParser and TikaParser always skip extracting text and title for 
noIndex page - page which have noIndex robots metatags.
But some parse-filter may still interested in text and title such as 
NUTCH-1661, where we may decide wether to follow a page by it's language.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (NUTCH-1705) Make configuration option for HtmlParser & TikaParser to extract text or title for noIndex page

2014-01-15 Thread Tien Nguyen Manh (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1705:


Attachment: NUTCH-1705.patch

> Make configuration option for HtmlParser & TikaParser to extract text or 
> title for noIndex page
> ---
>
> Key: NUTCH-1705
> URL: https://issues.apache.org/jira/browse/NUTCH-1705
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Tien Nguyen Manh
>Priority: Minor
> Attachments: NUTCH-1705.patch
>
>
> Currently HtmlParser and TikaParser always skip extracting text and title for 
> noIndex page - page which have noIndex robots metatags.
> But some parse-filter may still interested in text and title such as 
> NUTCH-1661, where we may decide wether to follow a page by it's language.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (NUTCH-1478) Parse-metatags and index-metadata plugin for Nutch 2.x series

2014-01-15 Thread Tien Nguyen Manh (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1478:


Attachment: NUTCH-1478-parse-v2.patch

i port parse-metatags to 2.x, this patch support multi-value in metatags.

> Parse-metatags and index-metadata plugin for Nutch 2.x series 
> --
>
> Key: NUTCH-1478
> URL: https://issues.apache.org/jira/browse/NUTCH-1478
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 2.1
>Reporter: kiran
> Fix For: 2.3
>
> Attachments: NUTCH-1478-parse-v2.patch, Nutch1478.patch, 
> Nutch1478.zip, metadata_parseChecker_sites.png
>
>
> I have ported parse-metatags and index-metadata plugin to Nutch 2.x series.  
> This will take multiple values of same tag and index in Solr as i patched 
> before (https://issues.apache.org/jira/browse/NUTCH-1467).
> The usage is same as described here 
> (http://wiki.apache.org/nutch/IndexMetatags) but one change is that there is 
> no need to give 'metatag' keyword before metatag names. For example my 
> configuration looks like this 
> (https://github.com/salvager/NutchDev/blob/master/runtime/local/conf/nutch-site.xml)
>  
> This is only the first version and does not include the junit test. I will 
> update the new version soon.
> This will parse the tags and index the tags in Solr. Make sure you create the 
> fields in 'index.parse.md' in nutch-site.xml in schema.xml in Solr.
> Please let me know if you have any suggestions
> This is supported by DLA (Digital Library and Archives) of Virginia Tech.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Created] (NUTCH-1704) Port DomainBlacklist urlfilter to 2.x

2014-01-15 Thread Tien Nguyen Manh (JIRA)

Tien Nguyen Manh created NUTCH-1704:
---

 Summary: Port DomainBlacklist urlfilter to 2.x
 Key: NUTCH-1704
 URL: https://issues.apache.org/jira/browse/NUTCH-1704
 Project: Nutch
  Issue Type: Improvement
Reporter: Tien Nguyen Manh
 Attachments: NUTCH-1704.patch

Port NUTCH-1210 to 2.x



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (NUTCH-1704) Port DomainBlacklist urlfilter to 2.x

2014-01-15 Thread Tien Nguyen Manh (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1704:


Attachment: NUTCH-1704.patch

> Port DomainBlacklist urlfilter to 2.x
> -
>
> Key: NUTCH-1704
> URL: https://issues.apache.org/jira/browse/NUTCH-1704
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Tien Nguyen Manh
> Attachments: NUTCH-1704.patch
>
>
> Port NUTCH-1210 to 2.x



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (NUTCH-1702) Port HostNormalizer to 2.x

2014-01-15 Thread Tien Nguyen Manh (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1702:


Attachment: (was: NUTCH-1702.patch)

> Port HostNormalizer to 2.x
> --
>
> Key: NUTCH-1702
> URL: https://issues.apache.org/jira/browse/NUTCH-1702
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Tien Nguyen Manh
> Fix For: 2.3
>
> Attachments: NUTCH-1702.patch
>
>
> Port NUTCH-1319 to 2.x



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (NUTCH-1702) Port HostNormalizer to 2.x

2014-01-15 Thread Tien Nguyen Manh (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1702:


Attachment: NUTCH-1702.patch

> Port HostNormalizer to 2.x
> --
>
> Key: NUTCH-1702
> URL: https://issues.apache.org/jira/browse/NUTCH-1702
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Tien Nguyen Manh
> Fix For: 2.3
>
> Attachments: NUTCH-1702.patch
>
>
> Port NUTCH-1319 to 2.x



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (NUTCH-1702) Port HostNormalizer to 2.x

2014-01-15 Thread Tien Nguyen Manh (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1702:


Fix Version/s: 2.3

> Port HostNormalizer to 2.x
> --
>
> Key: NUTCH-1702
> URL: https://issues.apache.org/jira/browse/NUTCH-1702
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Tien Nguyen Manh
> Fix For: 2.3
>
> Attachments: NUTCH-1702.patch
>
>
> Port NUTCH-1319 to 2.x



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (NUTCH-1702) Port HostNormalizer to 2.x

2014-01-15 Thread Tien Nguyen Manh (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1702:


Attachment: NUTCH-1702.patch

> Port HostNormalizer to 2.x
> --
>
> Key: NUTCH-1702
> URL: https://issues.apache.org/jira/browse/NUTCH-1702
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Tien Nguyen Manh
> Fix For: 2.3
>
> Attachments: NUTCH-1702.patch
>
>
> Port NUTCH-1319 to 2.x



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Created] (NUTCH-1702) Port HostNormalizer to 2.x

2014-01-15 Thread Tien Nguyen Manh (JIRA)

Tien Nguyen Manh created NUTCH-1702:
---

 Summary: Port HostNormalizer to 2.x
 Key: NUTCH-1702
 URL: https://issues.apache.org/jira/browse/NUTCH-1702
 Project: Nutch
  Issue Type: Improvement
Reporter: Tien Nguyen Manh


Port NUTCH-1319 to 2.x



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (NUTCH-1701) Make Solr Document Boost as an option

2014-01-14 Thread Tien Nguyen Manh (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1701:


Attachment: NUTCH-1701-2x.patch

> Make Solr Document Boost as an option
> -
>
> Key: NUTCH-1701
> URL: https://issues.apache.org/jira/browse/NUTCH-1701
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Tien Nguyen Manh
>Priority: Minor
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1701-2x.patch
>
>
> Nutch SolrIndexer use Nutch score as document boost by default. We should 
> make it as an option because we can use nutch score to boost in different way 
> such as boost at query time via function query



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (NUTCH-1701) Make Solr Document Boost as an option

2014-01-14 Thread Tien Nguyen Manh (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1701:


Fix Version/s: 1.8
   2.3

> Make Solr Document Boost as an option
> -
>
> Key: NUTCH-1701
> URL: https://issues.apache.org/jira/browse/NUTCH-1701
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Tien Nguyen Manh
>Priority: Minor
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1701-2x.patch
>
>
> Nutch SolrIndexer use Nutch score as document boost by default. We should 
> make it as an option because we can use nutch score to boost in different way 
> such as boost at query time via function query



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Created] (NUTCH-1701) Make Solr Document Boost as an option

2014-01-14 Thread Tien Nguyen Manh (JIRA)

Tien Nguyen Manh created NUTCH-1701:
---

 Summary: Make Solr Document Boost as an option
 Key: NUTCH-1701
 URL: https://issues.apache.org/jira/browse/NUTCH-1701
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Tien Nguyen Manh
Priority: Minor


Nutch SolrIndexer use Nutch score as document boost by default. We should make 
it as an option because we can use nutch score to boost in different way such 
as boost at query time via function query



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (NUTCH-1693) TextMD5Signatue compute on textual content

2014-01-02 Thread Tien Nguyen Manh (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13861195#comment-13861195
 ] 

Tien Nguyen Manh commented on NUTCH-1693:
-

this patch only work with a minor change that compute signature after seting 
text to "page" that i made in NUTCH-1686

> TextMD5Signatue compute on textual content
> --
>
> Key: NUTCH-1693
> URL: https://issues.apache.org/jira/browse/NUTCH-1693
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Tien Nguyen Manh
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1693.patch
>
>
> I create a new MD5Signature that based on textual content. In our case we use 
> boilerpipe to extract main text from content so this signature is more 
> effective to deduplicate.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (NUTCH-1693) TextMD5Signatue compute on textual content

2014-01-02 Thread Tien Nguyen Manh (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1693:


Fix Version/s: 2.3

> TextMD5Signatue compute on textual content
> --
>
> Key: NUTCH-1693
> URL: https://issues.apache.org/jira/browse/NUTCH-1693
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Tien Nguyen Manh
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1693.patch
>
>
> I create a new MD5Signature that based on textual content. In our case we use 
> boilerpipe to extract main text from content so this signature is more 
> effective to deduplicate.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (NUTCH-1693) TextMD5Signatue compute on textual content

2014-01-02 Thread Tien Nguyen Manh (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1693:


Attachment: NUTCH-1693.patch

> TextMD5Signatue compute on textual content
> --
>
> Key: NUTCH-1693
> URL: https://issues.apache.org/jira/browse/NUTCH-1693
> Project: Nutch
>  Issue Type: Bug
>Reporter: Tien Nguyen Manh
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1693.patch
>
>
> I create a new MD5Signature that based on textual content. In our case we use 
> boilerpipe to extract main text from content so this signature is more 
> effective to deduplicate.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (NUTCH-1693) TextMD5Signatue compute on textual content

2014-01-02 Thread Tien Nguyen Manh (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1693:


Issue Type: New Feature  (was: Bug)

> TextMD5Signatue compute on textual content
> --
>
> Key: NUTCH-1693
> URL: https://issues.apache.org/jira/browse/NUTCH-1693
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Tien Nguyen Manh
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1693.patch
>
>
> I create a new MD5Signature that based on textual content. In our case we use 
> boilerpipe to extract main text from content so this signature is more 
> effective to deduplicate.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Created] (NUTCH-1693) TextMD5Signatue compute on textual content

2014-01-02 Thread Tien Nguyen Manh (JIRA)

Tien Nguyen Manh created NUTCH-1693:
---

 Summary: TextMD5Signatue compute on textual content
 Key: NUTCH-1693
 URL: https://issues.apache.org/jira/browse/NUTCH-1693
 Project: Nutch
  Issue Type: Bug
Reporter: Tien Nguyen Manh
Priority: Minor


I create a new MD5Signature that based on textual content. In our case we use 
boilerpipe to extract main text from content so this signature is more 
effective to deduplicate.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (NUTCH-1686) Optimize UpdateDb to load less field from Store

2014-01-02 Thread Tien Nguyen Manh (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13861142#comment-13861142
 ] 

Tien Nguyen Manh commented on NUTCH-1686:
-

In this patch i also fixed an bug with fetchTime. Currently each time we run 
whole updatedb, fetchTime is increased again for all urls.

> Optimize UpdateDb to load less field from Store
> ---
>
> Key: NUTCH-1686
> URL: https://issues.apache.org/jira/browse/NUTCH-1686
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.3
>Reporter: Tien Nguyen Manh
> Fix For: 2.3
>
> Attachments: NUTCH-1686.patch
>
>
> While running large crawl i found that updatedb run very slow, especially the 
> Map task which loading data from store.
> We can't use filter by batchId to load less url due to bug in NUTCH-1679 so 
> we must always update the whole table.
> After checking the field loaded in UpdateDbJob i found that it load many 
> fields from store (at least 15/25 field) which make updatedb slow
> I think that UpdateDbJob only need to load few field: SCORE, OUTLINKS, 
> METADATA which is used to compute link score, distance that i think the main 
> purpose of this job.
> The other fields is used to compute url schedule to parser and fetcher, we 
> can move code to Parser or Fetcher whithout loading much new field because 
> many field are generated from parser. WE can also use gora filter for Fetcher 
> or Parser so load new field is not a problem.
> I also add new field SCOREMETA to WebPage to store CASH, and DISTANCE. It is 
> currently store in METADATA. field CASH is used in OPICScoring which is used 
> only in UpdateDB and distance is used only in Generator and Updater so move 
> both field two new Metadata field can prevent reading METADATA at Generator 
> and Updater, METADATA contains many data that is used only at Parser and 
> Indexer
> So with new change
> UpdateDb only load SCORE, SCOREMATA (CASH, DISTANCE), OUTLINKS, MAKERS: we 
> don't need to load big family Fetch and INLINKS.
> Generator only load SCOREMETA (which is smaller than current METADATA)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (NUTCH-1687) Pick queue in Round Robin

2013-12-30 Thread Tien Nguyen Manh (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13859364#comment-13859364
 ] 

Tien Nguyen Manh commented on NUTCH-1687:
-

It is nice!

> Pick queue in Round Robin
> -
>
> Key: NUTCH-1687
> URL: https://issues.apache.org/jira/browse/NUTCH-1687
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Tien Nguyen Manh
>Priority: Minor
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1687.patch, NUTCH-1687.tejasp.v1.patch
>
>
> Currently we chose queue to pick url from start of queues list, so queue at 
> the start of list have more change to be pick first, that can cause problem 
> of long tail queue, which only few queue available at the end which have many 
> urls.
> public synchronized FetchItem getFetchItem() {
>   final Iterator> it =
> queues.entrySet().iterator(); ==> always reset to find queue from 
> start
>   while (it.hasNext()) {
> 
> I think it is better to pick queue in round robin, that can make reduce time 
> to find the available queue and make all queue was picked in round robin and 
> if we use TopN during generator there are no long tail queue at the end.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (NUTCH-1687) Pick queue in Round Robin

2013-12-30 Thread Tien Nguyen Manh (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13859299#comment-13859299
 ] 

Tien Nguyen Manh commented on NUTCH-1687:
-

1. It seem redundant in this context.
2. i add id, so that queues map can delete FetchItemQueue by it's id quickly, 
if not we must navigate from start of queues.

> Pick queue in Round Robin
> -
>
> Key: NUTCH-1687
> URL: https://issues.apache.org/jira/browse/NUTCH-1687
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Tien Nguyen Manh
>Priority: Minor
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1687.patch
>
>
> Currently we chose queue to pick url from start of queues list, so queue at 
> the start of list have more change to be pick first, that can cause problem 
> of long tail queue, which only few queue available at the end which have many 
> urls.
> public synchronized FetchItem getFetchItem() {
>   final Iterator> it =
> queues.entrySet().iterator(); ==> always reset to find queue from 
> start
>   while (it.hasNext()) {
> 
> I think it is better to pick queue in round robin, that can make reduce time 
> to find the available queue and make all queue was picked in round robin and 
> if we use TopN during generator there are no long tail queue at the end.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (NUTCH-1687) Pick queue in Round Robin

2013-12-29 Thread Tien Nguyen Manh (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1687:


Attachment: NUTCH-1687.patch

add Apache Header
fixed lost tail pointer when deleting

> Pick queue in Round Robin
> -
>
> Key: NUTCH-1687
> URL: https://issues.apache.org/jira/browse/NUTCH-1687
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Tien Nguyen Manh
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1687.patch
>
>
> Currently we chose queue to pick url from start of queues list, so queue at 
> the start of list have more change to be pick first, that can cause problem 
> of long tail queue, which only few queue available at the end which have many 
> urls.
> public synchronized FetchItem getFetchItem() {
>   final Iterator> it =
> queues.entrySet().iterator(); ==> always reset to find queue from 
> start
>   while (it.hasNext()) {
> 
> I think it is better to pick queue in round robin, that can make reduce time 
> to find the available queue and make all queue was picked in round robin and 
> if we use TopN during generator there are no long tail queue at the end.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (NUTCH-1687) Pick queue in Round Robin

2013-12-29 Thread Tien Nguyen Manh (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1687:


Attachment: (was: NUTCH-1687.patch)

> Pick queue in Round Robin
> -
>
> Key: NUTCH-1687
> URL: https://issues.apache.org/jira/browse/NUTCH-1687
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Tien Nguyen Manh
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1687.patch
>
>
> Currently we chose queue to pick url from start of queues list, so queue at 
> the start of list have more change to be pick first, that can cause problem 
> of long tail queue, which only few queue available at the end which have many 
> urls.
> public synchronized FetchItem getFetchItem() {
>   final Iterator> it =
> queues.entrySet().iterator(); ==> always reset to find queue from 
> start
>   while (it.hasNext()) {
> 
> I think it is better to pick queue in round robin, that can make reduce time 
> to find the available queue and make all queue was picked in round robin and 
> if we use TopN during generator there are no long tail queue at the end.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

49 matches

Mail list logo