[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher

2015-04-08 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485299#comment-14485299
 ] 

lufeng commented on NUTCH-1854:
---

if we set fetcher.store.content=false and fetcher.parse=false then the 
bin/nutch parse command will throw exception to check the input content 
directory exist. So I think why we need this parameter because something we set 
the fetcher.parse to true and don't want to store the content because of slow 
disk or not much disk space. So I think we can remove this parameter of 
fetcher.store.content and if the parameter of fetcher.parse=true we don't 
store the page content.

 ./bin/crawl fails with a parsing fetcher
 

 Key: NUTCH-1854
 URL: https://issues.apache.org/jira/browse/NUTCH-1854
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.9
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.11

 Attachments: NUTCH-1854ver1.patch


 If you run ./bin/crawl with a parsing fetcher e.g.
 property
namefetcher.parse/name
valuefalse/value
descriptionIf true, fetcher will parse content. Default is false,
  which means
that a separate parsing step is required after fetching is
  finished./description
  /property
 we get a horrible message as follows
 Exception in thread main java.io.IOException: Segment already parsed!
 We could improve this by making logging more complete and by adding a trigger 
 to the crawl script which would check for crawl_parse for a given segment and 
 then skip parsing if this is present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1939) Fetcher fails to follow redirects

2015-02-10 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315374#comment-14315374
 ] 

lufeng commented on NUTCH-1939:
---

Hi Sebastian

One question. How do you use the FetchItem returned by queueRedirect method. 
I don't find any code to use this returned object. I think queueRedirect 
method has already add this redirect url back to fetch queue.

 Fetcher fails to follow redirects
 -

 Key: NUTCH-1939
 URL: https://issues.apache.org/jira/browse/NUTCH-1939
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.9
Reporter: Sebastian Nagel
 Fix For: 1.10

 Attachments: NUTCH-1939.patch


 As reported by [~leoyey] in NUTCH-1735 which introduced the regression: with 
 http.redirect.max  0 Fetcher does not follow redirects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (NUTCH-1939) Fetcher fails to follow redirects

2015-02-10 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315374#comment-14315374
 ] 

lufeng edited comment on NUTCH-1939 at 2/11/15 2:16 AM:


I think that's correct. +1


was (Author: amuseme.lu):
Hi Sebastian

One question. How do you use the FetchItem returned by queueRedirect method. 
I don't find any code to use this returned object. I think queueRedirect 
method has already add this redirect url back to fetch queue.

 Fetcher fails to follow redirects
 -

 Key: NUTCH-1939
 URL: https://issues.apache.org/jira/browse/NUTCH-1939
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.9
Reporter: Sebastian Nagel
 Fix For: 1.10

 Attachments: NUTCH-1939.patch


 As reported by [~leoyey] in NUTCH-1735 which introduced the regression: with 
 http.redirect.max  0 Fetcher does not follow redirects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1829) Generator : unable to distinguish real errors

2014-08-25 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110193#comment-14110193
 ] 

lufeng commented on NUTCH-1829:
---

yes, I think we should distinguish different return result using different 
return code. So we can determine the next action according to this return code. 

 Generator : unable to distinguish real errors
 -

 Key: NUTCH-1829
 URL: https://issues.apache.org/jira/browse/NUTCH-1829
 Project: Nutch
  Issue Type: Bug
  Components: nutchNewbie
Affects Versions: 1.9, 2.2.1
 Environment: Ubuntu Server 14.04, OpenJDK 7
Reporter: Mathieu Bouchard

 The bin/nutch generate command is returning the same error code (-1) if there 
 is an error or no new segment to process, so there is no way to tell if the 
 error is real or not from a shell script. This problem is related to 
 NUTCH-1828.
 The problem can be fixed by modifying the following Java source file:
 http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java?revision=1619934view=markup
 At line 711, if there are no new segment, the generator returns -1, which is 
 the same return code returned at line 714 if there was an error.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-385) Improve description of thread related configuration for Fetcher

2014-06-26 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045525#comment-14045525
 ] 

lufeng commented on NUTCH-385:
--

Hi Julien

I see the description of fetcher.threads.per.queue we can add setting 
fetcher.threads.per.queue to value  1 will also cause fetcher.server.delay 
to be ignore. 

Another issue is that I think this property fetcher.max.crawl.delay is not 
uniform with fetcher.server.delay and fetcher.server.min.delay. It is 
changed to fetcher.server.max.delay more suitable?


 Improve description of thread related configuration for Fetcher
 ---

 Key: NUTCH-385
 URL: https://issues.apache.org/jira/browse/NUTCH-385
 Project: Nutch
  Issue Type: Bug
  Components: documentation, fetcher
Reporter: Chris Schneider
Assignee: Julien Nioche
 Fix For: 1.9

 Attachments: NUTCH-385.patch


 For some time I've been puzzled by the interaction between two paramters that 
 control how often the fetcher can access a particular host:
 1) The server delay, which comes back from the remote server during our 
 processing of the robots.txt file, and which can be limited by 
 fetcher.max.crawl.delay.
 2) The fetcher.threads.per.host value, particularly when this is greater than 
 the default of 1.
 According to my (limited) understanding of the code in HttpBase.java:
 Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher 
 ends up keeping either 1 or 2 fetcher threads pointing at a particular host 
 continuously. In other words, it never tries to point 3 at the host, and it 
 always points a second thread at the host before the first thread finishes 
 accessing it. Since HttpBase.unblockAddr never gets called with 
 (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts 
 System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
 host. Thus, the server delay will never be used at all. The fetcher will be 
 continuously retrieving pages from the host, often with 2 fetchers accessing 
 the host simultaneously.
 Suppose instead that the fetcher finally does allow the last thread to 
 complete before it gets around to pointing another thread at the target host. 
 When the last fetcher thread calls HttpBase.unblockAddr, it will now put 
 System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
 host. This, in turn, will prevent any threads from accessing this host until 
 the delay is complete, even though zero threads are currently accessing the 
 host.
 I see this behavior as inconsistent. More importantly, the current 
 implementation certainly doesn't seem to answer my original question about 
 appropriate definitions for what appear to be conflicting parameters. 
 In a nutshell, how could we possibly honor the server delay if we allow more 
 than one fetcher thread to simultaneously access the host?
 It would be one thing if whenever (fetcher.threads.per.host  1), this 
 trumped the server delay, causing the latter to be ignored completely. That 
 is certainly not the case in the current implementation, as it will wait for 
 server delay whenever the number of threads accessing a given host drops to 
 zero.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1785) Ability to index raw content

2014-05-28 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14010889#comment-14010889
 ] 

lufeng commented on NUTCH-1785:
---

+1 elasticsearch 1.2.0 test ok. 

one question is why convert content byte[] to String type? If one segment 
contain both html and PDF or mp3 content. How to set this base64 parameter? 

 Ability to index raw content
 

 Key: NUTCH-1785
 URL: https://issues.apache.org/jira/browse/NUTCH-1785
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.9

 Attachments: NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, 
 NUTCH-1785-trunk.patch


 Some use-cases require Nutch to actually write the raw content a configured 
 indexing back-end. Since Content is never read, a plugin is out of the 
 question and therefore we need to force IndexJob to process Content as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (NUTCH-1521) CrawlDbFilter pass null url to urlNormailzers

2014-04-16 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng closed NUTCH-1521.
-

   Resolution: Fixed
Fix Version/s: (was: 2.4)
   1.9

 CrawlDbFilter pass null url to urlNormailzers
 -

 Key: NUTCH-1521
 URL: https://issues.apache.org/jira/browse/NUTCH-1521
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.7
Reporter: lufeng
Assignee: lufeng
Priority: Trivial
 Fix For: 1.9

 Attachments: CrawlDbFilter_v1.patch, NUTCH-1521-trunk.patch, 
 TestCrawlDbFilter.java


 urlNormalizers will get null url if we set CRAWLDB_PURGE_404, and it will 
 throw NullPointerException. and the WARN Log will output something like this 
 Skipping null NullPointerException.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1726) HeadingsFilter does not find nested nodes

2014-04-15 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13969601#comment-13969601
 ] 

lufeng commented on NUTCH-1726:
---

Hi all, Can someone free to check this patch? thanks. 

 HeadingsFilter does not find nested nodes
 -

 Key: NUTCH-1726
 URL: https://issues.apache.org/jira/browse/NUTCH-1726
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.9

 Attachments: NUTCH-1726-trunk-v2.patch, NUTCH-1726-trunk.patch, 
 NUTCH-1726-trunk.patch


 Filter won't find:
 {code}
 h1spanapache nutch/span/h1
 {code}
 The getNodeValue() tries to read data from children but should traverse nodes 
 instead.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1752) cache robots.txt rules per protocol:host:port

2014-04-09 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13964219#comment-13964219
 ] 

lufeng commented on NUTCH-1752:
---

Do you mean different port with same protocol and host has different robots.txt 
file?

+1 


 cache robots.txt rules per protocol:host:port
 -

 Key: NUTCH-1752
 URL: https://issues.apache.org/jira/browse/NUTCH-1752
 Project: Nutch
  Issue Type: Bug
  Components: protocol
Affects Versions: 1.8, 2.2.1
Reporter: Sebastian Nagel
 Fix For: 2.3, 1.9

 Attachments: NUTCH-1752-v1.patch


 HttpRobotRulesParser caches rules from {{robots.txt}} per protocol:host 
 (before NUTCH-1031 caching was per host only). The caching should be per 
 protocol:host:port. In doubt, a request to a different port may deliver a 
 different {{robots.txt}}. 
 Applying robots.txt rules to a combination of host, protocol, and port is 
 common practice:
 [Norobots RFC 1996 draft|http://www.robotstxt.org/norobots-rfc.txt] does not 
 mention this explicitly (could be derived from examples) but others do:
 * [Wikipedia|http://en.wikipedia.org/wiki/Robots.txt]: each protocol and 
 port needs its own robots.txt file
 * [Google 
 webmasters|https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt]:
  The directives listed in the robots.txt file apply only to the host, 
 protocol and port number where the file is hosted.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1733) parse-html to support HTML5 charset definitions

2014-03-18 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13938867#comment-13938867
 ] 

lufeng commented on NUTCH-1733:
---

+1 pass all tests

 parse-html to support HTML5 charset definitions
 ---

 Key: NUTCH-1733
 URL: https://issues.apache.org/jira/browse/NUTCH-1733
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.8, 2.2.1
Reporter: Sebastian Nagel
 Fix For: 2.3, 1.9

 Attachments: NUTCH-1733-trunk.patch, charset_bom_html5.html, 
 charset_html5.html


 HTML 5 allows to specify the character encoding of a page per
 * {{meta charset=...}}
 * Unicode Byte Order Mark (BOM)
 These are allowed in addition to previous HTTP/http-equiv Content-Type, see 
 [[1|http://www.w3.org/TR/2011/WD-html5-diff-20110405/#character-encoding]].
 Parse-html ignores both meta charset and BOM, falls back to the default 
 encoding (cp1252). Parse-tika sets the encoding appropriately.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1736) Can't fetch page if http response header contains Transfer-Encoding:chunked

2014-03-16 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13937426#comment-13937426
 ] 

lufeng commented on NUTCH-1736:
---

Hi ysc

you can check the content size to fix this issue like this. 

{code:java}
if (http.getMaxContent() = 0  (contentBytesRead + chunkLen)  
http.getMaxContent() )
  chunkLen= http.getMaxContent() - contentBytesRead;
{code}

 Can't fetch page if http response header contains Transfer-Encoding:chunked
 ---

 Key: NUTCH-1736
 URL: https://issues.apache.org/jira/browse/NUTCH-1736
 Project: Nutch
  Issue Type: Bug
  Components: protocol
Affects Versions: 1.6, 2.1, 1.7, 2.2, 2.3, 1.8, 2.4, 1.9, 2.2.1
Reporter: ysc
Priority: Critical
 Fix For: 2.3, 1.9

 Attachments: nutch-2.2.1.patch, nutch1.7.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 fetching: 
 http://szs.mof.gov.cn/zhengwuxinxi/zhengcefabu/201402/t20140224_1046354.html
 Fetch failed with protocol status: EXCEPTION: java.io.IOException: 
 unzipBestEffort returned null



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1726) HeadingsFilter does not find nested nodes

2014-02-24 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13910355#comment-13910355
 ] 

lufeng commented on NUTCH-1726:
---

Hi Markus

It seems that HeadingsFilter does not find nested nodes in my testing code. but 
I can not restore your testing result when I use following process to testing 
our patch

{code:bash}
 svn checkout https://svn.apache.org/repos/asf/nutch/trunk nutch-svn2
 cd nutch-svn2
 patch -p0  NUTCH-1726-trunk.patch
 ant
 cd src/plugin/headings/
 ant test
{code}

everything seems ok.

yes, you are right, maybe someone want to ignore long headers. But do we need 
to set headings.maxlength option to -1 to disable this check, maybe someone 
want to disable this feature.

Feng





 HeadingsFilter does not find nested nodes
 -

 Key: NUTCH-1726
 URL: https://issues.apache.org/jira/browse/NUTCH-1726
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.8

 Attachments: NUTCH-1726-trunk-v2.patch, NUTCH-1726-trunk.patch, 
 NUTCH-1726-trunk.patch


 Filter won't find:
 {code}
 h1spanapache nutch/span/h1
 {code}
 The getNodeValue() tries to read data from children but should traverse nodes 
 instead.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Comment Edited] (NUTCH-1726) HeadingsFilter does not find nested nodes

2014-02-24 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13910355#comment-13910355
 ] 

lufeng edited comment on NUTCH-1726 at 2/24/14 2:41 PM:


Hi Markus

It seems that HeadingsFilter does not find nested nodes in my testing code. but 
I can not restore your testing result when I use following process to testing 
our patch

{code:java}
 svn checkout https://svn.apache.org/repos/asf/nutch/trunk nutch-svn2
 cd nutch-svn2
 patch -p0  NUTCH-1726-trunk.patch
 ant
 cd src/plugin/headings/
 ant test
{code}

everything seems ok.

yes, you are right, maybe someone want to ignore long headers. But do we need 
to set headings.maxlength option to -1 to disable this check, maybe someone 
want to disable this feature.

Feng






was (Author: amuseme.lu):
Hi Markus

It seems that HeadingsFilter does not find nested nodes in my testing code. but 
I can not restore your testing result when I use following process to testing 
our patch

{code:bash}
 svn checkout https://svn.apache.org/repos/asf/nutch/trunk nutch-svn2
 cd nutch-svn2
 patch -p0  NUTCH-1726-trunk.patch
 ant
 cd src/plugin/headings/
 ant test
{code}

everything seems ok.

yes, you are right, maybe someone want to ignore long headers. But do we need 
to set headings.maxlength option to -1 to disable this check, maybe someone 
want to disable this feature.

Feng





 HeadingsFilter does not find nested nodes
 -

 Key: NUTCH-1726
 URL: https://issues.apache.org/jira/browse/NUTCH-1726
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.8

 Attachments: NUTCH-1726-trunk-v2.patch, NUTCH-1726-trunk.patch, 
 NUTCH-1726-trunk.patch


 Filter won't find:
 {code}
 h1spanapache nutch/span/h1
 {code}
 The getNodeValue() tries to read data from children but should traverse nodes 
 instead.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1726) HeadingsFilter does not find nested nodes

2014-02-13 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900432#comment-13900432
 ] 

lufeng commented on NUTCH-1726:
---

Hi Markus. 

But I didn't find any error using your newest patch. 

{code:xml}
test:
[echo] Testing plugin: headings
[junit] Running org.apache.nutch.parse.headings.TestHeadingsParseFilter
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 1.142 sec

BUILD SUCCESSFUL
Total time: 3 seconds
{code}

* maybe you can truncate log headers if it's size is larger than the value of 
maxlength option. so headings.truncate option can be removed.




 HeadingsFilter does not find nested nodes
 -

 Key: NUTCH-1726
 URL: https://issues.apache.org/jira/browse/NUTCH-1726
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.8

 Attachments: NUTCH-1726-trunk-v2.patch, NUTCH-1726-trunk.patch, 
 NUTCH-1726-trunk.patch


 Filter won't find:
 {code}
 h1spanapache nutch/span/h1
 {code}
 The getNodeValue() tries to read data from children but should traverse nodes 
 instead.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1726) HeadingsFilter does not find nested nodes

2014-02-12 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1726:
--

Attachment: NUTCH-1726-trunk-v2.patch

add a test case to check HeadingsFilter patch. :)

 HeadingsFilter does not find nested nodes
 -

 Key: NUTCH-1726
 URL: https://issues.apache.org/jira/browse/NUTCH-1726
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.8

 Attachments: NUTCH-1726-trunk-v2.patch, NUTCH-1726-trunk.patch


 Filter won't find:
 {code}
 h1spanapache nutch/span/h1
 {code}
 The getNodeValue() tries to read data from children but should traverse nodes 
 instead.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1691) DomainBlacklist url filter does not allow -D filter file override

2014-01-03 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13861502#comment-13861502
 ] 

lufeng commented on NUTCH-1691:
---

like urlfilter-prefix plugin, we can move WARN code to maintain the code unity. 
:)

 DomainBlacklist url filter does not allow -D filter file override
 -

 Key: NUTCH-1691
 URL: https://issues.apache.org/jira/browse/NUTCH-1691
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.8, 2.4

 Attachments: NUTCH-1691-trunk.patch


 This filter does not accept -Durlfilter.domainblacklist.file= overrides. The 
 plugin's file attribute is always used.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1667) Updatedb always ignore batchId

2013-11-22 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13830525#comment-13830525
 ] 

lufeng commented on NUTCH-1667:
---

yes, u are right. +1

 Updatedb always ignore batchId
 --

 Key: NUTCH-1667
 URL: https://issues.apache.org/jira/browse/NUTCH-1667
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.3
Reporter: Nguyen Manh Tien
Priority: Minor
 Attachments: NUTCH-1556-batchId.patch


 batchId is not set in currentJob because we set batchId after creating 
 currentJob, so in UpdateDbMapper batchId is null and will be assign to -all.
 I change to set batchId befor creating currentJob



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1671) indexchecker to add digest field

2013-11-22 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13830530#comment-13830530
 ] 

lufeng commented on NUTCH-1671:
---

yes, this field can be used by indexing filters.  +1
another question is that should we add check code after parse content like this

{code:java}
ParseResult parseResult = new ParseUtil(conf).parse(content); 

if (parseResult == null) {
  LOG.error(Problem with parse - check log);
  return (-1);
}

{code}

 indexchecker to add digest field
 

 Key: NUTCH-1671
 URL: https://issues.apache.org/jira/browse/NUTCH-1671
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.7, 2.2.1
Reporter: Sebastian Nagel
Priority: Trivial
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1671-2x.patch, NUTCH-1671-trunk.patch


 IndexingFiltersChecker does not add field digest as done by 
 IndexerMapReduce. Digest/signature could be also used by indexing filters 
 which then may fail.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (NUTCH-1670) set same crawldb directory in mergedb parameter

2013-11-20 Thread lufeng (JIRA)
lufeng created NUTCH-1670:
-

 Summary: set same crawldb directory in mergedb parameter
 Key: NUTCH-1670
 URL: https://issues.apache.org/jira/browse/NUTCH-1670
 Project: Nutch
  Issue Type: Bug
  Components: crawldb
Affects Versions: 1.7
Reporter: lufeng
Assignee: lufeng
Priority: Minor
 Fix For: 1.8


when merge two crawldb using the same crawldb directory in bin/nutch merge 
paramater, it will throw data not found exception. 

bin/nutch mergedb crawldb_t1 crawldb_t1 crawldb_2
bin/nutch generate crawldb_t1 segment





--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (NUTCH-1670) set same crawldb directory in mergedb parameter

2013-11-20 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1670:
--

Attachment: NUTCH-1670.patch

 set same crawldb directory in mergedb parameter
 ---

 Key: NUTCH-1670
 URL: https://issues.apache.org/jira/browse/NUTCH-1670
 Project: Nutch
  Issue Type: Bug
  Components: crawldb
Affects Versions: 1.7
Reporter: lufeng
Assignee: lufeng
Priority: Minor
 Fix For: 1.8

 Attachments: NUTCH-1670.patch


 when merge two crawldb using the same crawldb directory in bin/nutch merge 
 paramater, it will throw data not found exception. 
 bin/nutch mergedb crawldb_t1 crawldb_t1 crawldb_2
 bin/nutch generate crawldb_t1 segment



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Work started] (NUTCH-1670) set same crawldb directory in mergedb parameter

2013-11-20 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-1670 started by lufeng.

 set same crawldb directory in mergedb parameter
 ---

 Key: NUTCH-1670
 URL: https://issues.apache.org/jira/browse/NUTCH-1670
 Project: Nutch
  Issue Type: Bug
  Components: crawldb
Affects Versions: 1.7
Reporter: lufeng
Assignee: lufeng
Priority: Minor
 Fix For: 1.8

 Attachments: NUTCH-1670.patch


 when merge two crawldb using the same crawldb directory in bin/nutch merge 
 paramater, it will throw data not found exception. 
 bin/nutch mergedb crawldb_t1 crawldb_t1 crawldb_2
 bin/nutch generate crawldb_t1 segment



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1651) modifiedTime and prevmodifiedTime never set

2013-11-04 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13812840#comment-13812840
 ] 

lufeng commented on NUTCH-1651:
---

Hi Lewis
yes, the patch is ok, and this a way to set ModifiedTime. +1 

 modifiedTime and prevmodifiedTime never set 
 

 Key: NUTCH-1651
 URL: https://issues.apache.org/jira/browse/NUTCH-1651
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.2.1
Reporter: Talat UYARER
 Fix For: 2.3

 Attachments: NUTCH-1651.patch


 modifiedTime is never set. If you use DefaultFetchScheduler, modifiedTime is 
 always zero as default. But if you use AdaptiveFetchScheduler, modifiedTime 
 is set only once in the beginning by zero-control of AdaptiveFetchScheduler.
 But this is not sufficient since modifiedTime needs to be updated whenever 
 last modified time is available. We corrected this with a patch.
 Also we noticed that prevModifiedTime is not written to database and we 
 corrected that too.
 With this patch, whenever lastModifiedTime is available, we do two things. 
 First we set modifiedTime in the Page object to prevModifiedTime. After that 
 we set lastModifiedTime to modifiedTime.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1651) modifiedTime and prevmodifiedTime never set

2013-10-30 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13809081#comment-13809081
 ] 

lufeng commented on NUTCH-1651:
---

Hi Talat

yes, u are right, lastModified is a fetch parameter, but this can also be set 
by parser plugins, because this attribute can also defined by parsers. it's a 
attribute of WebPage. 

I don't find any code in Nutch 2.x to set the ModifiedTime in WebPage, also not 
find in Nutch1.x. very strange.



 modifiedTime and prevmodifiedTime never set 
 

 Key: NUTCH-1651
 URL: https://issues.apache.org/jira/browse/NUTCH-1651
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.2.1
Reporter: Talat UYARER
 Fix For: 2.3

 Attachments: NUTCH-1651.patch


 modifiedTime is never set. If you use DefaultFetchScheduler, modifiedTime is 
 always zero as default. But if you use AdaptiveFetchScheduler, modifiedTime 
 is set only once in the beginning by zero-control of AdaptiveFetchScheduler.
 But this is not sufficient since modifiedTime needs to be updated whenever 
 last modified time is available. We corrected this with a patch.
 Also we noticed that prevModifiedTime is not written to database and we 
 corrected that too.
 With this patch, whenever lastModifiedTime is available, we do two things. 
 First we set modifiedTime in the Page object to prevModifiedTime. After that 
 we set lastModifiedTime to modifiedTime.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1651) modifiedTime and prevmodifiedTime never set

2013-10-29 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808045#comment-13808045
 ] 

lufeng commented on NUTCH-1651:
---

Hi Talat

but I think get last modified from header is not appropriate to put in here. If 
user want to check the modification of a html in parser plugin through it's 
content of that url not that metadata in html headers. even the value of 
Last-Modified in headers is changed.

{code:java}
+Utf8 lastModified = page.getFromHeaders(new Utf8(Last-Modified));
+if ( lastModified != null ){
+  try {
+modifiedTime = HttpDateFormat.toLong(lastModified.toString());
+prevModifiedTime = page.getModifiedTime();
+  } catch (Exception e) {
+  }
+}
{code}

maybe appropriate way is to let parser plugin defined by user to set the value 
of modified time not in DbUpdateReducer class.

 modifiedTime and prevmodifiedTime never set 
 

 Key: NUTCH-1651
 URL: https://issues.apache.org/jira/browse/NUTCH-1651
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.2.1
Reporter: Talat UYARER
 Fix For: 2.3

 Attachments: NUTCH-1651.patch


 modifiedTime is never set. If you use DefaultFetchScheduler, modifiedTime is 
 always zero as default. But if you use AdaptiveFetchScheduler, modifiedTime 
 is set only once in the beginning by zero-control of AdaptiveFetchScheduler.
 But this is not sufficient since modifiedTime needs to be updated whenever 
 last modified time is available. We corrected this with a patch.
 Also we noticed that prevModifiedTime is not written to database and we 
 corrected that too.
 With this patch, whenever lastModifiedTime is available, we do two things. 
 First we set modifiedTime in the Page object to prevModifiedTime. After that 
 we set lastModifiedTime to modifiedTime.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (NUTCH-1645) Junit Test Case for Adaptive Fetch Schedule class

2013-10-28 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1645:
--

Attachment: NUTCH-1645-v3.patch

1. add an implementation of reaches a lower number of misses would cause the 
test to fail
2. improve the code style 

yes, you are right, this unit test only check for the equality of some key 
statistics as you said. But the problem is how to write test case to verify 
the correctness of some algorithms in Nutch like AdaptiveFetchSchedule class 
and find the bug that you pointed in (NUTCH-1564)? Could you give me some 
suggestions. and I will check the NUTCH-1564 and hope to find a solution to 
this issue.

Thanks Sebastian

 Junit Test Case for Adaptive Fetch Schedule class
 -

 Key: NUTCH-1645
 URL: https://issues.apache.org/jira/browse/NUTCH-1645
 Project: Nutch
  Issue Type: Test
Affects Versions: 2.2.1
Reporter: Talat UYARER
Priority: Minor
 Fix For: 2.3

 Attachments: NUTCH-1645.patch, NUTCH-1645-v2.patch, 
 NUTCH-1645-v3.patch


 Currently there is not Test Case for Adaptive Fetch Schedule. Junit test 
 Writes for its. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (NUTCH-1645) Junit Test Case for Adaptive Fetch Schedule class

2013-10-06 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1645:
--

Attachment: NUTCH-1645-v2.patch

add two test case, one is use default parameters and another without open sync 
delta. 

thanks Yasin, you can add another test case with some parameter change.  

 Junit Test Case for Adaptive Fetch Schedule class
 -

 Key: NUTCH-1645
 URL: https://issues.apache.org/jira/browse/NUTCH-1645
 Project: Nutch
  Issue Type: Test
Affects Versions: 2.2.1
Reporter: Talat UYARER
Priority: Minor
 Fix For: 2.3

 Attachments: NUTCH-1645.patch, NUTCH-1645-v2.patch


 Currently there is not Test Case for Adaptive Fetch Schedule. Junit test 
 Writes for its. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1650) Adaptive Fetch Scheduler interval Wrong Set

2013-10-06 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13787664#comment-13787664
 ] 

lufeng commented on NUTCH-1650:
---

yes , this code in Nutch 1.x is correct. +1

 Adaptive Fetch Scheduler interval Wrong Set
 ---

 Key: NUTCH-1650
 URL: https://issues.apache.org/jira/browse/NUTCH-1650
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.2.1
Reporter: Talat UYARER
Priority: Minor
  Labels: scheduler
 Fix For: 2.3

 Attachments: NUTCH-1650.patch


 After calculation interval time when setting it didn't check between max and 
 min values.  Moreover if sync_delta is true. Interval set before changes. 
 This patch fix this.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId

2013-09-12 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765410#comment-13765410
 ] 

lufeng commented on NUTCH-1556:
---

oh, I'm so sorry, I already fixed this problem.

commit revision 1522566 in 2.x HEAD.

thanks Julien.

 enabling updatedb to accept batchId 
 

 Key: NUTCH-1556
 URL: https://issues.apache.org/jira/browse/NUTCH-1556
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.2
Reporter: kaveh minooie
 Fix For: 2.3

 Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, 
 NUTCH-1556-v3.patch


 So the idea here is to be able to run updatedb and fetch for different 
 batchId simultaneously. I put together a patch. it seems to be working ( it 
 does skip the rows that do not match the batchId), but I am worried if and 
 how it might affect the sorting in the reduce part. anyway check it out. 
 it also change the command line usage to this:
 Usage: DbUpdaterJob (batchId | -all) [-crawlId id]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1636) Indexer to normalize and filter repr URL

2013-09-09 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13761888#comment-13761888
 ] 

lufeng commented on NUTCH-1636:
---

yes, this patch can solve the issue reported by lain. +1

 Indexer to normalize and filter repr URL
 

 Key: NUTCH-1636
 URL: https://issues.apache.org/jira/browse/NUTCH-1636
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.6, 1.7
Reporter: Sebastian Nagel
Priority: Minor
 Fix For: 1.8

 Attachments: NUTCH-1636-1.patch


 Indexer if used with option -normalize and/or -filter (cf. NUTCH-1300) should 
 also normalize and filter representation URLs. Otherwise URLs which are 
 target of a redirect, and have repr URL set (see URLUtil.chooseRepr) may show 
 up in index with an undesirable URL.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId

2013-09-05 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13759123#comment-13759123
 ] 

lufeng commented on NUTCH-1556:
---

Committed revision 1520332 in 2.x HEAD
Thanks kaveh. 

 enabling updatedb to accept batchId 
 

 Key: NUTCH-1556
 URL: https://issues.apache.org/jira/browse/NUTCH-1556
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.2
Reporter: kaveh minooie
 Fix For: 2.3

 Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, 
 NUTCH-1556-v3.patch


 So the idea here is to be able to run updatedb and fetch for different 
 batchId simultaneously. I put together a patch. it seems to be working ( it 
 does skip the rows that do not match the batchId), but I am worried if and 
 how it might affect the sorting in the reduce part. anyway check it out. 
 it also change the command line usage to this:
 Usage: DbUpdaterJob (batchId | -all) [-crawlId id]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1556) enabling updatedb to accept batchId

2013-09-05 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng resolved NUTCH-1556.
---

Resolution: Fixed

 enabling updatedb to accept batchId 
 

 Key: NUTCH-1556
 URL: https://issues.apache.org/jira/browse/NUTCH-1556
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.2
Reporter: kaveh minooie
 Fix For: 2.3

 Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, 
 NUTCH-1556-v3.patch


 So the idea here is to be able to run updatedb and fetch for different 
 batchId simultaneously. I put together a patch. it seems to be working ( it 
 does skip the rows that do not match the batchId), but I am worried if and 
 how it might affect the sorting in the reduce part. anyway check it out. 
 it also change the command line usage to this:
 Usage: DbUpdaterJob (batchId | -all) [-crawlId id]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId

2013-09-02 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13756080#comment-13756080
 ] 

lufeng commented on NUTCH-1556:
---

I will commit this unless there are objections

 enabling updatedb to accept batchId 
 

 Key: NUTCH-1556
 URL: https://issues.apache.org/jira/browse/NUTCH-1556
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.2
Reporter: kaveh minooie
 Fix For: 2.3

 Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, 
 NUTCH-1556-v3.patch


 So the idea here is to be able to run updatedb and fetch for different 
 batchId simultaneously. I put together a patch. it seems to be working ( it 
 does skip the rows that do not match the batchId), but I am worried if and 
 how it might affect the sorting in the reduce part. anyway check it out. 
 it also change the command line usage to this:
 Usage: DbUpdaterJob (batchId | -all) [-crawlId id]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId

2013-08-28 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13752432#comment-13752432
 ] 

lufeng commented on NUTCH-1556:
---

thanks kaveh. +1

 enabling updatedb to accept batchId 
 

 Key: NUTCH-1556
 URL: https://issues.apache.org/jira/browse/NUTCH-1556
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.2
Reporter: kaveh minooie
 Fix For: 2.3

 Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, 
 NUTCH-1556-v3.patch


 So the idea here is to be able to run updatedb and fetch for different 
 batchId simultaneously. I put together a patch. it seems to be working ( it 
 does skip the rows that do not match the batchId), but I am worried if and 
 how it might affect the sorting in the reduce part. anyway check it out. 
 it also change the command line usage to this:
 Usage: DbUpdaterJob (batchId | -all) [-crawlId id]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1556) enabling updatedb to accept batchId

2013-08-27 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1556:
--

Attachment: NUTCH-1556-v2.patch

new patch merged with issue 1632

 enabling updatedb to accept batchId 
 

 Key: NUTCH-1556
 URL: https://issues.apache.org/jira/browse/NUTCH-1556
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.2
Reporter: kaveh minooie
 Fix For: 2.3

 Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch


 So the idea here is to be able to run updatedb and fetch for different 
 batchId simultaneously. I put together a patch. it seems to be working ( it 
 does skip the rows that do not match the batchId), but I am worried if and 
 how it might affect the sorting in the reduce part. anyway check it out. 
 it also change the command line usage to this:
 Usage: DbUpdaterJob (batchId | -all) [-crawlId id]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (NUTCH-1632) add batchId argument for DbUpdaterJob

2013-08-26 Thread lufeng (JIRA)
lufeng created NUTCH-1632:
-

 Summary: add batchId argument for DbUpdaterJob
 Key: NUTCH-1632
 URL: https://issues.apache.org/jira/browse/NUTCH-1632
 Project: Nutch
  Issue Type: Improvement
  Components: crawldb
Affects Versions: 2.2.1
Reporter: lufeng
Assignee: lufeng
Priority: Minor
 Fix For: 2.3


add batchId argument for DbUpdaterJob, you can put the batchId to DbUpdaterJob.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1632) add batchId argument for DbUpdaterJob

2013-08-26 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1632:
--

Attachment: NUTCH-1632.patch

 add batchId argument for DbUpdaterJob
 -

 Key: NUTCH-1632
 URL: https://issues.apache.org/jira/browse/NUTCH-1632
 Project: Nutch
  Issue Type: Improvement
  Components: crawldb
Affects Versions: 2.2.1
Reporter: lufeng
Assignee: lufeng
Priority: Minor
 Fix For: 2.3

 Attachments: NUTCH-1632.patch


 add batchId argument for DbUpdaterJob, you can put the batchId to 
 DbUpdaterJob.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-1632) add batchId argument for DbUpdaterJob

2013-08-26 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng closed NUTCH-1632.
-

Resolution: Duplicate

 add batchId argument for DbUpdaterJob
 -

 Key: NUTCH-1632
 URL: https://issues.apache.org/jira/browse/NUTCH-1632
 Project: Nutch
  Issue Type: Improvement
  Components: crawldb
Affects Versions: 2.2.1
Reporter: lufeng
Assignee: lufeng
Priority: Minor
 Fix For: 2.3

 Attachments: NUTCH-1632.patch


 add batchId argument for DbUpdaterJob, you can put the batchId to 
 DbUpdaterJob.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId

2013-08-26 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13750803#comment-13750803
 ] 

lufeng commented on NUTCH-1556:
---

Hi Lewis, I'm sorry, I generate a duplicate issue. I will merge these two patch 
into one and can you check this out. thanks.

 enabling updatedb to accept batchId 
 

 Key: NUTCH-1556
 URL: https://issues.apache.org/jira/browse/NUTCH-1556
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.2
Reporter: kaveh minooie
 Fix For: 2.3

 Attachments: NUTCH-1556.patch


 So the idea here is to be able to run updatedb and fetch for different 
 batchId simultaneously. I put together a patch. it seems to be working ( it 
 does skip the rows that do not match the batchId), but I am worried if and 
 how it might affect the sorting in the reduce part. anyway check it out. 
 it also change the command line usage to this:
 Usage: DbUpdaterJob (batchId | -all) [-crawlId id]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1632) add batchId argument for DbUpdaterJob

2013-08-26 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13750804#comment-13750804
 ] 

lufeng commented on NUTCH-1632:
---

Hi kaveh, I'm sorry and I will close this issue and merge the two patch into 
one. thanks.

 add batchId argument for DbUpdaterJob
 -

 Key: NUTCH-1632
 URL: https://issues.apache.org/jira/browse/NUTCH-1632
 Project: Nutch
  Issue Type: Improvement
  Components: crawldb
Affects Versions: 2.2.1
Reporter: lufeng
Assignee: lufeng
Priority: Minor
 Fix For: 2.3

 Attachments: NUTCH-1632.patch


 add batchId argument for DbUpdaterJob, you can put the batchId to 
 DbUpdaterJob.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1619) Writes Dmoz Description and Title information to db with snippet argument

2013-08-25 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13749663#comment-13749663
 ] 

lufeng commented on NUTCH-1619:
---

Hi Julien,I have already fixed the compilation bug, and I will be pay attention 
in the next time, thanks for reminding. 

 Writes Dmoz Description and Title information to db with snippet argument
 -

 Key: NUTCH-1619
 URL: https://issues.apache.org/jira/browse/NUTCH-1619
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.1
Reporter: Yasin Kılınç
Priority: Minor
 Fix For: 2.3

 Attachments: NUTCH-1619.patch, NUTCH-DMOZ-Snippet.patch


 We need Dmoz information of fetched URLs can be written to database. So these 
 information can be used like snipppet by indexer of the search engine we are 
 working on.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1619) Writes Dmoz Description and Title information to db with snippet argument

2013-08-24 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13749409#comment-13749409
 ] 

lufeng commented on NUTCH-1619:
---

Committed @revision 1517147 in 2.x HEAD
Thank you very much Talat for the patch.


 Writes Dmoz Description and Title information to db with snippet argument
 -

 Key: NUTCH-1619
 URL: https://issues.apache.org/jira/browse/NUTCH-1619
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.1
Reporter: Yasin Kılınç
Priority: Minor
 Fix For: 2.3

 Attachments: NUTCH-1619.patch, NUTCH-DMOZ-Snippet.patch


 We need Dmoz information of fetched URLs can be written to database. So these 
 information can be used like snipppet by indexer of the search engine we are 
 working on.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1619) Writes Dmoz Description and Title information to db with snippet argument

2013-08-24 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng resolved NUTCH-1619.
---

Resolution: Fixed

 Writes Dmoz Description and Title information to db with snippet argument
 -

 Key: NUTCH-1619
 URL: https://issues.apache.org/jira/browse/NUTCH-1619
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.1
Reporter: Yasin Kılınç
Priority: Minor
 Fix For: 2.3

 Attachments: NUTCH-1619.patch, NUTCH-DMOZ-Snippet.patch


 We need Dmoz information of fetched URLs can be written to database. So these 
 information can be used like snipppet by indexer of the search engine we are 
 working on.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1619) Writes Dmoz Description and Title information to db with snippet argument

2013-08-24 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13749419#comment-13749419
 ] 

lufeng commented on NUTCH-1619:
---

I'm so sorry, DataStore may not throw IOException. It has already been fixed.
Committed @revision 1517155 in 2.x HEAD

 Writes Dmoz Description and Title information to db with snippet argument
 -

 Key: NUTCH-1619
 URL: https://issues.apache.org/jira/browse/NUTCH-1619
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.1
Reporter: Yasin Kılınç
Priority: Minor
 Fix For: 2.3

 Attachments: NUTCH-1619.patch, NUTCH-DMOZ-Snippet.patch


 We need Dmoz information of fetched URLs can be written to database. So these 
 information can be used like snipppet by indexer of the search engine we are 
 working on.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1631) Display Document Count Added To Solr Server

2013-08-23 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13748595#comment-13748595
 ] 

lufeng commented on NUTCH-1631:
---

Good statistical methods. +1 

 Display Document Count Added To Solr Server
 ---

 Key: NUTCH-1631
 URL: https://issues.apache.org/jira/browse/NUTCH-1631
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 2.1, 2.2, 2.2.1
Reporter: Furkan KAMACI
Priority: Minor
 Fix For: 2.3

 Attachments: NUTCH-1631.patch


 Currently you can not see how many documents are added to Solr Server from 
 Nutch. One should be able to see how many documents are added to Solr Server 
 simultaneously (as a hadoop counter) and also total document count should be 
 logged too after all documents are added to Solr Server.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1619) Writes Dmoz Description and Title information to db with snippet argument

2013-08-22 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13747558#comment-13747558
 ] 

lufeng commented on NUTCH-1619:
---

Thanks Talat. +1 for commit. 

 Writes Dmoz Description and Title information to db with snippet argument
 -

 Key: NUTCH-1619
 URL: https://issues.apache.org/jira/browse/NUTCH-1619
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.1
Reporter: Yasin Kılınç
Priority: Minor
 Fix For: 2.3

 Attachments: NUTCH-1619.patch, NUTCH-DMOZ-Snippet.patch


 We need Dmoz information of fetched URLs can be written to database. So these 
 information can be used like snipppet by indexer of the search engine we are 
 working on.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1619) Writes Dmoz Description and Title information to db with snippet argument

2013-08-19 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13743621#comment-13743621
 ] 

lufeng commented on NUTCH-1619:
---

Hi Yasin, Do you forget to close the data store? good.

 Writes Dmoz Description and Title information to db with snippet argument
 -

 Key: NUTCH-1619
 URL: https://issues.apache.org/jira/browse/NUTCH-1619
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.1
Reporter: Yasin Kılınç
Priority: Minor
 Fix For: 2.3

 Attachments: NUTCH-DMOZ-Snippet.patch


 We need Dmoz information of fetched URLs can be written to database. So these 
 information can be used like snipppet by indexer of the search engine we are 
 working on.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1294) IndexClean job with solr implementation.

2013-08-14 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739731#comment-13739731
 ] 

lufeng commented on NUTCH-1294:
---

Hi Lewis. Very pleasure. But What can I do something for README.txt? Do you 
mean I will also change something in 
https://svn.apache.org/repos/asf/nutch/branches/2.x/README.txt. :)

 IndexClean job with solr implementation.
 

 Key: NUTCH-1294
 URL: https://issues.apache.org/jira/browse/NUTCH-1294
 Project: Nutch
  Issue Type: Improvement
Affects Versions: nutchgora
Reporter: Dan Rosher
Priority: Minor
 Fix For: 2.3

 Attachments: NUTCH-1294.patch, NUTCH-1294-v2.patch, 
 NUTCH-1294-v3.patch


 I started by copying/altering the trunk version of SolrClean, though is was 
 inadequate for our needs. We needed to mark particular pages as gone even 
 though they still might be visible on the web, this implementation abstracts 
 the index cleaning process, has a Solr implementation, and adds a clean index 
 plugin extension that allows others to tailor how pages might be removed from 
 their store.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1613) Timeouts in protocol-httpclient when crawling same host with 2 threads and added cookie strings for both http protocols

2013-07-21 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13714701#comment-13714701
 ] 

lufeng commented on NUTCH-1613:
---

ok, Does this cookie will effect other urls that these urls don't need any 
cookie and will receive Bad Request error when using httpclient? It seems not 
very general so can we able to add a filter to specify the different host using 
a different cookie.

 Timeouts in protocol-httpclient when crawling same host with 2 threads and 
 added cookie strings for both http protocols
 

 Key: NUTCH-1613
 URL: https://issues.apache.org/jira/browse/NUTCH-1613
 Project: Nutch
  Issue Type: Bug
  Components: protocol
Affects Versions: 2.2.1
Reporter: Brian
Priority: Minor
  Labels: patch
 Fix For: 2.3

 Attachments: NUTCH-1613.patch


 1.)  When using protocol-httpclient to crawl a single website (the same host) 
 I would always get a bunch of timeout errors during fetching and the pages 
 with errors would not be fetched. E.g.:
 2013-07-09 17:57:13,717 WARN  fetcher.FetcherJob - fetch of http://www 
 failed with: org.apache.commons.httpclient.ConnectionPoolTimeoutException: 
 Timeout waiting for connection
 2013-07-09 17:57:13,718 INFO  fetcher.FetcherJob - fetching http://www 
 (queue crawl delay=0ms)
 2013-07-09 17:57:13,715 ERROR httpclient.Http - Failed with the following 
 error: 
 org.apache.commons.httpclient.ConnectionPoolTimeoutException: Timeout waiting 
 for connection
   at 
 org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:497)
   at 
 org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:416)
   at 
 org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:153)
   at 
 org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
   at 
 org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
   at 
 org.apache.nutch.protocol.httpclient.HttpResponse.init(HttpResponse.java:95)
   at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:174)
   at 
 org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:133)
   at 
 org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:518)
 This is because by default the connection pool manager only allows 2 
 connections per host so if more than 2 threads are used the others will tend 
 to time out waiting to get a connection.   The code previously set max 
 connections correctly but not connection per host.
 2.) I also added at the same time simple modifications to both protocol-http 
 and protocol-httpclient to allow specifying a cookie string in the conf file 
 to include in request headers.  
 I use this to crawl site content requiring authentication - it is better for 
 me to specify the cookie string for the authentication than go through the 
 whole authentication process and specifying login info.
 The nutch-site.xml property is the following:
 property
 namehttp.cookie_string/name
 valueXX_AL=authorization_value_goes_here/value
   descriptionString to use as the cookie value for HTTP 
 requests/description
 /property
 Although I use it for authentication it can be used to specify any single 
 cookie string for the crawl (httpclient does support different cookies for 
 different hosts but I did not get into that).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1613) Timeouts in protocol-httpclient when crawling same host with 2 threads and added cookie strings for both http protocols

2013-07-17 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13711150#comment-13711150
 ] 

lufeng commented on NUTCH-1613:
---

Does this specified cookie string will effect all crawling urls? 

 Timeouts in protocol-httpclient when crawling same host with 2 threads and 
 added cookie strings for both http protocols
 

 Key: NUTCH-1613
 URL: https://issues.apache.org/jira/browse/NUTCH-1613
 Project: Nutch
  Issue Type: Bug
  Components: protocol
Affects Versions: 2.2.1
Reporter: Brian
Priority: Minor
  Labels: patch
 Attachments: NUTCH-1613.patch


 1.)  When using protocol-httpclient to crawl a single website (the same host) 
 I would always get a bunch of timeout errors during fetching and the pages 
 with errors would not be fetched. E.g.:
 2013-07-09 17:57:13,717 WARN  fetcher.FetcherJob - fetch of http://www 
 failed with: org.apache.commons.httpclient.ConnectionPoolTimeoutException: 
 Timeout waiting for connection
 2013-07-09 17:57:13,718 INFO  fetcher.FetcherJob - fetching http://www 
 (queue crawl delay=0ms)
 2013-07-09 17:57:13,715 ERROR httpclient.Http - Failed with the following 
 error: 
 org.apache.commons.httpclient.ConnectionPoolTimeoutException: Timeout waiting 
 for connection
   at 
 org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:497)
   at 
 org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:416)
   at 
 org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:153)
   at 
 org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
   at 
 org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
   at 
 org.apache.nutch.protocol.httpclient.HttpResponse.init(HttpResponse.java:95)
   at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:174)
   at 
 org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:133)
   at 
 org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:518)
 This is because by default the connection pool manager only allows 2 
 connections per host so if more than 2 threads are used the others will tend 
 to time out waiting to get a connection.   The code previously set max 
 connections correctly but not connection per host.
 2.) I also added at the same time simple modifications to both protocol-http 
 and protocol-httpclient to allow specifying a cookie string in the conf file 
 to include in request headers.  
 I use this to crawl site content requiring authentication - it is better for 
 me to specify the cookie string for the authentication than go through the 
 whole authentication process and specifying login info.
 The nutch-site.xml property is the following:
 property
 namehttp.cookie_string/name
 valueXX_AL=authorization_value_goes_here/value
   descriptionString to use as the cookie value for HTTP 
 requests/description
 /property
 Although I use it for authentication it can be used to specify any single 
 cookie string for the crawl (httpclient does support different cookies for 
 different hosts but I did not get into that).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1602) improve the readability of metadata in readdb dump normal

2013-07-04 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13700082#comment-13700082
 ] 

lufeng commented on NUTCH-1602:
---

Hi Markus, this output format only used in *normal* output format, not within 
CSV output format. there are two different crawl datum output format. now the 
normal output like this, better than previous one.

{code:xml}
http://www.baidu.com/   Version: 7
Status: 3 (db_gone)
Fetch time: Sat Aug 17 22:35:37 CST 2013
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 3888000 seconds (45 days)
Score: 1.0
Signature: null
Metadata: 
m1=v22
m3=v3
m2=v2
m5=v5
m4=m4
_pst_=robots_denied(18), lastModified=0
m6=v6

{code}

thanks Julien and Tejas.

 improve the readability of metadata in readdb dump normal 
 --

 Key: NUTCH-1602
 URL: https://issues.apache.org/jira/browse/NUTCH-1602
 Project: Nutch
  Issue Type: Improvement
  Components: crawldb
Affects Versions: 1.7
Reporter: lufeng
Assignee: lufeng
Priority: Minor
 Fix For: 1.8

 Attachments: NUTCH-1602.patch


 the dumped metadata format is not readable.
 {code:xml}
 $bin/nutch readdb crawldb/ -dump dir
 http://www.baidu.com/ Version: 7
 Status: 3 (db_gone)
 Fetch time: Sat Aug 17 22:35:37 CST 2013
 Modified time: Thu Jan 01 08:00:00 CST 1970
 Retries since fetch: 0
 Retry interval: 3888000 seconds (45 days)
 Score: 1.0
 Signature: null
 Metadata: m1: v22m3: v3m2: v2m5: v5m4: m4_pst_: robots_denied(18), 
 lastModified=0m6: v6
 {code}
 so I improve the Metadata format to this
 {code:xml}
 Metadata: m1=v22;m3=v3;m2=v2;m5=v5;m4=m4;_pst_=robots_denied(18), 
 lastModified=0;m6=v6;
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1602) improve the readability of metadata in readdb dump normal

2013-07-04 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng resolved NUTCH-1602.
---

Resolution: Fixed

 improve the readability of metadata in readdb dump normal 
 --

 Key: NUTCH-1602
 URL: https://issues.apache.org/jira/browse/NUTCH-1602
 Project: Nutch
  Issue Type: Improvement
  Components: crawldb
Affects Versions: 1.7
Reporter: lufeng
Assignee: lufeng
Priority: Minor
 Fix For: 1.8

 Attachments: NUTCH-1602-2.patch, NUTCH-1602.patch


 the dumped metadata format is not readable.
 {code:xml}
 $bin/nutch readdb crawldb/ -dump dir
 http://www.baidu.com/ Version: 7
 Status: 3 (db_gone)
 Fetch time: Sat Aug 17 22:35:37 CST 2013
 Modified time: Thu Jan 01 08:00:00 CST 1970
 Retries since fetch: 0
 Retry interval: 3888000 seconds (45 days)
 Score: 1.0
 Signature: null
 Metadata: m1: v22m3: v3m2: v2m5: v5m4: m4_pst_: robots_denied(18), 
 lastModified=0m6: v6
 {code}
 so I improve the Metadata format to this
 {code:xml}
 Metadata: m1=v22;m3=v3;m2=v2;m5=v5;m4=m4;_pst_=robots_denied(18), 
 lastModified=0;m6=v6;
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1600) Injector overwrite does not always work properly

2013-07-03 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13699034#comment-13699034
 ] 

lufeng commented on NUTCH-1600:
---

test work fine. 
+1

 Injector overwrite does not always work properly
 

 Key: NUTCH-1600
 URL: https://issues.apache.org/jira/browse/NUTCH-1600
 Project: Nutch
  Issue Type: Bug
  Components: injector
Affects Versions: 1.7
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.8

 Attachments: NUTCH-1600-1.8.patch


 db.injector.update works as it should but db.injector.overwrite doesn't 
 always seem to properly overwrite the record. This issue exists for some time 
 and we've already fixed it in our dist of Nutch.
 This record just has been updated (interval).
 {code}
 Injector: starting at 2013-07-03 10:34:15
 Injector: crawlDb: crawl/crawldb
 Injector: urlDir: seeds
 Injector: Converting injected urls to crawl db entries.
 Injector: total number of urls rejected by filters: 0
 Injector: total number of urls injected after normalization and filtering: 9
 Injector: Merging injected urls into crawl db.
 Injector: finished at 2013-07-03 10:34:21, elapsed: 00:00:05
 URL: url
 Version: 7
 Status: 2 (db_fetched)
 Fetch time: Fri Jul 05 12:11:44 CEST 2013
 Modified time: Fri Jun 28 12:11:44 CEST 2013
 Retries since fetch: 0
 Retry interval: 604800 seconds (7 days)
 Score: 0.0
 Signature: ba29ef3e680323a6d0da74c156800e03
 Metadata: Content-Type: text/html_pst_: success(1), lastModified=0
 {code}
 If we now overwrite the record, nothing happens. With this patch installed it 
 overwrites the record as it should and also logs update  overwrite switches 
 to console:
 {code}
 Injector: starting at 2013-07-03 10:36:30
 Injector: crawlDb: crawl/crawldb
 Injector: urlDir: seeds
 Injector: Converting injected urls to crawl db entries.
 Injector: total number of urls rejected by filters: 0
 Injector: total number of urls injected after normalization and filtering: 9
 Injector: Merging injected urls into crawl db.
 Injector: overwrite: true
 Injector: update: false
 Injector: finished at 2013-07-03 10:36:36, elapsed: 00:00:05
 URL: url
 Version: 7
 Status: 1 (db_unfetched)
 Fetch time: Wed Jul 03 10:36:30 CEST 2013
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 14000 seconds (0 days)
 Score: 1.0
 Signature: null
 Metadata: fixedInterval: 14000.0
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1602) improve the readability of metadata in readdb dump normal

2013-07-03 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1602:
--

Attachment: NUTCH-1602.patch

 improve the readability of metadata in readdb dump normal 
 --

 Key: NUTCH-1602
 URL: https://issues.apache.org/jira/browse/NUTCH-1602
 Project: Nutch
  Issue Type: Improvement
  Components: crawldb
Affects Versions: 1.7
Reporter: lufeng
Assignee: lufeng
Priority: Minor
 Fix For: 1.8

 Attachments: NUTCH-1602.patch


 the dumped metadata format is not readable.
 {code:xml}
 $bin/nutch readdb crawldb/ -dump dir
 http://www.baidu.com/ Version: 7
 Status: 3 (db_gone)
 Fetch time: Sat Aug 17 22:35:37 CST 2013
 Modified time: Thu Jan 01 08:00:00 CST 1970
 Retries since fetch: 0
 Retry interval: 3888000 seconds (45 days)
 Score: 1.0
 Signature: null
 Metadata: m1: v22m3: v3m2: v2m5: v5m4: m4_pst_: robots_denied(18), 
 lastModified=0m6: v6
 {code}
 so I improve the Metadata format to this
 {code:xml}
 Metadata: m1=v22;m3=v3;m2=v2;m5=v5;m4=m4;_pst_=robots_denied(18), 
 lastModified=0;m6=v6;
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (NUTCH-1602) improve the readability of metadata in readdb dump normal

2013-07-03 Thread lufeng (JIRA)
lufeng created NUTCH-1602:
-

 Summary: improve the readability of metadata in readdb dump normal 
 Key: NUTCH-1602
 URL: https://issues.apache.org/jira/browse/NUTCH-1602
 Project: Nutch
  Issue Type: Improvement
  Components: crawldb
Affects Versions: 1.7
Reporter: lufeng
Assignee: lufeng
Priority: Minor
 Fix For: 1.8


the dumped metadata format is not readable.

{code:xml}
$bin/nutch readdb crawldb/ -dump dir
http://www.baidu.com/   Version: 7
Status: 3 (db_gone)
Fetch time: Sat Aug 17 22:35:37 CST 2013
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 3888000 seconds (45 days)
Score: 1.0
Signature: null
Metadata: m1: v22m3: v3m2: v2m5: v5m4: m4_pst_: robots_denied(18), 
lastModified=0m6: v6
{code}

so I improve the Metadata format to this

{code:xml}
Metadata: m1=v22;m3=v3;m2=v2;m5=v5;m4=m4;_pst_=robots_denied(18), 
lastModified=0;m6=v6;
{code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1594) count variable is never changed in ParseUtil class

2013-07-01 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13696798#comment-13696798
 ] 

lufeng commented on NUTCH-1594:
---

Committed @revision 1498437 in 2.x HEAD. Thanks Canan and Lewis.

 count variable is never changed in ParseUtil class
 --

 Key: NUTCH-1594
 URL: https://issues.apache.org/jira/browse/NUTCH-1594
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 2.2
Reporter: lufeng
Assignee: lufeng
Priority: Minor
 Fix For: 2.3

 Attachments: NUTCH-1594.patch


 in ParseUtil class the count variable is never change. the code is like this 
 for (int i = 0; count  maxOutlinks  i  outlinks.length; i++) 
 so even if you define the db.max.outlinks.per.page parameter, it will not 
 take effect.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1327) QueryStringNormalizer

2013-07-01 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13696854#comment-13696854
 ] 

lufeng commented on NUTCH-1327:
---

Hi Markus, I tested you patch, Do you forget to add deploy and test target into 
src/plugin/build.xml?

+1 

 QueryStringNormalizer
 -

 Key: NUTCH-1327
 URL: https://issues.apache.org/jira/browse/NUTCH-1327
 Project: Nutch
  Issue Type: New Feature
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.9

 Attachments: NUTCH-1327-1.8-1.patch


 A normalizer for dealing with query strings. Sorting query strings is helpful 
 in preventing duplicates for some (bad) websites.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (NUTCH-1594) count variable is never in ParseUtil

2013-06-29 Thread lufeng (JIRA)
lufeng created NUTCH-1594:
-

 Summary: count variable is never in ParseUtil 
 Key: NUTCH-1594
 URL: https://issues.apache.org/jira/browse/NUTCH-1594
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 2.2
Reporter: lufeng
Priority: Minor
 Fix For: 2.3


in ParseUtil class the count variable is never change. the code is like this 
for (int i = 0; count  maxOutlinks  i  outlinks.length; i++) 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1594) count variable is never changed in ParseUtil class

2013-06-29 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1594:
--

Description: 
in ParseUtil class the count variable is never change. the code is like this 
for (int i = 0; count  maxOutlinks  i  outlinks.length; i++) 

so even if you define the db.max.outlinks.per.page parameter, it will not 
take effect.

  was:
in ParseUtil class the count variable is never change. the code is like this 
for (int i = 0; count  maxOutlinks  i  outlinks.length; i++) 

Summary: count variable is never changed in ParseUtil class  (was: 
count variable is never in ParseUtil )

 count variable is never changed in ParseUtil class
 --

 Key: NUTCH-1594
 URL: https://issues.apache.org/jira/browse/NUTCH-1594
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 2.2
Reporter: lufeng
Priority: Minor
 Fix For: 2.3


 in ParseUtil class the count variable is never change. the code is like this 
 for (int i = 0; count  maxOutlinks  i  outlinks.length; i++) 
 so even if you define the db.max.outlinks.per.page parameter, it will not 
 take effect.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1594) count variable is never changed in ParseUtil class

2013-06-29 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1594:
--

Patch Info: Patch Available

 count variable is never changed in ParseUtil class
 --

 Key: NUTCH-1594
 URL: https://issues.apache.org/jira/browse/NUTCH-1594
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 2.2
Reporter: lufeng
Priority: Minor
 Fix For: 2.3

 Attachments: NUTCH-1594.patch


 in ParseUtil class the count variable is never change. the code is like this 
 for (int i = 0; count  maxOutlinks  i  outlinks.length; i++) 
 so even if you define the db.max.outlinks.per.page parameter, it will not 
 take effect.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1594) count variable is never changed in ParseUtil class

2013-06-29 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1594:
--

Attachment: NUTCH-1594.patch

 count variable is never changed in ParseUtil class
 --

 Key: NUTCH-1594
 URL: https://issues.apache.org/jira/browse/NUTCH-1594
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 2.2
Reporter: lufeng
Priority: Minor
 Fix For: 2.3

 Attachments: NUTCH-1594.patch


 in ParseUtil class the count variable is never change. the code is like this 
 for (int i = 0; count  maxOutlinks  i  outlinks.length; i++) 
 so even if you define the db.max.outlinks.per.page parameter, it will not 
 take effect.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (NUTCH-1594) count variable is never changed in ParseUtil class

2013-06-29 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng reassigned NUTCH-1594:
-

Assignee: lufeng

 count variable is never changed in ParseUtil class
 --

 Key: NUTCH-1594
 URL: https://issues.apache.org/jira/browse/NUTCH-1594
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 2.2
Reporter: lufeng
Assignee: lufeng
Priority: Minor
 Fix For: 2.3

 Attachments: NUTCH-1594.patch


 in ParseUtil class the count variable is never change. the code is like this 
 for (int i = 0; count  maxOutlinks  i  outlinks.length; i++) 
 so even if you define the db.max.outlinks.per.page parameter, it will not 
 take effect.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch

2013-06-18 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13686830#comment-13686830
 ] 

lufeng commented on NUTCH-1527:
---

Thanks Markus, I try the patch and can index the document success. +1 for 
commit.

 Port nutch-elasticsearch-indexer to Nutch
 -

 Key: NUTCH-1527
 URL: https://issues.apache.org/jira/browse/NUTCH-1527
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.6, 2.1
Reporter: Lewis John McGibbney
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 2.4

 Attachments: NUTCH-1527.patch, NUTCH-1527.patch, NUTCH-1527.patch, 
 NUTCH-1527.patch, NUTCH-1527.patch


 The source repos for this can be found here [0].
 This issue should be inline with the work already done by Julien and others 
 over at NUTCH-1047.
 [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch

2013-06-17 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13685661#comment-13685661
 ] 

lufeng commented on NUTCH-1527:
---

Hi Markus, I have already tested the newest patch on my laptop. very cool. +1 
for commit.

{code:xml}
lemo@debian:~/Workspace/java/apache-workspace/nutch-svn/runtime/local$ 
bin/nutch index crawldb/ segmetns/20130617225826/
Indexer: starting at 2013-06-17 23:46:47
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
ElasticIndexWriter
elastic.cluster : elastic prefix cluster
elastic.index : elastic index command 
elastic.max.bulk.docs : elastic bulk index doc counts. (default 500) 
elastic.max.bulk.size : elastic bulk index length. (default 5001001 
~5MB)


Processing remaining requests [docs = 1, length = 7528, total docs = 1]
Processing to finalize last execute
Previous took in ms 27, including wait 21
Indexer: finished at 2013-06-17 23:46:57, elapsed: 00:00:10
{code}

but one question is that should we add elastic.cluster and elastic.index 
properties into the nutch-default.xml file?

 Port nutch-elasticsearch-indexer to Nutch
 -

 Key: NUTCH-1527
 URL: https://issues.apache.org/jira/browse/NUTCH-1527
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.6, 2.1
Reporter: Lewis John McGibbney
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 2.4

 Attachments: NUTCH-1527.patch, NUTCH-1527.patch, NUTCH-1527.patch, 
 NUTCH-1527.patch


 The source repos for this can be found here [0].
 This issue should be inline with the work already done by Julien and others 
 over at NUTCH-1047.
 [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch

2013-06-13 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13682380#comment-13682380
 ] 

lufeng commented on NUTCH-1527:
---

Hi Markus

1. Elastic search will load the configure file first, so you need to add 
config/elasticsearch.yml in your runtime/local/config. But I don't find any 
method to load configure file with configuration.

2. do you still have lucene-core-3.4.jar in you runtime/local/lib directory?  
or do you add this

{code:xml}
+  dependency org=org.elasticsearch name=elasticsearch rev=0.90.1
+conf=*-default/
{code}

code in ivy/ivy.xml file. 

maybe the elasticsearch can not load class in nutch plugins system.


 Port nutch-elasticsearch-indexer to Nutch
 -

 Key: NUTCH-1527
 URL: https://issues.apache.org/jira/browse/NUTCH-1527
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.6, 2.1
Reporter: Lewis John McGibbney
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 2.4

 Attachments: NUTCH-1527.patch, NUTCH-1527.patch, NUTCH-1527.patch


 The source repos for this can be found here [0].
 This issue should be inline with the work already done by Julien and others 
 over at NUTCH-1047.
 [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-1575) support solr authentication in nutch 2.x

2013-06-03 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng closed NUTCH-1575.
-


 support solr authentication in nutch 2.x
 

 Key: NUTCH-1575
 URL: https://issues.apache.org/jira/browse/NUTCH-1575
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 2.1
Reporter: lufeng
Assignee: lufeng
Priority: Minor
 Fix For: 2.2

 Attachments: NUTCH-1575.patch


 can solr authentication in nutch 2.x like 1.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1545) capture batchId and remove references to segments in 2.x crawl script.

2013-05-30 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1545:
--

Fix Version/s: (was: 2.3)
   2.2

 capture batchId and remove references to segments in 2.x crawl script.
 --

 Key: NUTCH-1545
 URL: https://issues.apache.org/jira/browse/NUTCH-1545
 Project: Nutch
  Issue Type: Task
Affects Versions: 2.1
Reporter: Lewis John McGibbney
Assignee: lufeng
Priority: Minor
 Fix For: 2.2

 Attachments: NUTCH-1545.patch, NUTCH-1545-v2.patch


 The concept of segment is replaced by batchId in 2.x
 I'm currently getting rid of segments references in 2.x
 This issue was flagged up and separate from NUTCH-1532 which I am working on.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1545) capture batchId and remove references to segments in 2.x crawl script.

2013-05-30 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13670376#comment-13670376
 ] 

lufeng commented on NUTCH-1545:
---

Committed for nutch 2.2 revision 1487875. by Feng. Thanks Tejas and Lewis.

 capture batchId and remove references to segments in 2.x crawl script.
 --

 Key: NUTCH-1545
 URL: https://issues.apache.org/jira/browse/NUTCH-1545
 Project: Nutch
  Issue Type: Task
Affects Versions: 2.1
Reporter: Lewis John McGibbney
Assignee: lufeng
Priority: Minor
 Fix For: 2.3

 Attachments: NUTCH-1545.patch, NUTCH-1545-v2.patch


 The concept of segment is replaced by batchId in 2.x
 I'm currently getting rid of segments references in 2.x
 This issue was flagged up and separate from NUTCH-1532 which I am working on.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1545) capture batchId and remove references to segments in 2.x crawl script.

2013-05-30 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng resolved NUTCH-1545.
---

Resolution: Fixed

 capture batchId and remove references to segments in 2.x crawl script.
 --

 Key: NUTCH-1545
 URL: https://issues.apache.org/jira/browse/NUTCH-1545
 Project: Nutch
  Issue Type: Task
Affects Versions: 2.1
Reporter: Lewis John McGibbney
Assignee: lufeng
Priority: Minor
 Fix For: 2.2

 Attachments: NUTCH-1545.patch, NUTCH-1545-v2.patch


 The concept of segment is replaced by batchId in 2.x
 I'm currently getting rid of segments references in 2.x
 This issue was flagged up and separate from NUTCH-1532 which I am working on.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1563) FetchSchedule#getFields is never used by GeneraterJob

2013-05-29 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng resolved NUTCH-1563.
---

Resolution: Fixed

 FetchSchedule#getFields is never used by GeneraterJob
 -

 Key: NUTCH-1563
 URL: https://issues.apache.org/jira/browse/NUTCH-1563
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 2.1
Reporter: lufeng
Assignee: lufeng
Priority: Minor
 Fix For: 2.2

 Attachments: NUTCH-1563.patch


 The method of getFields in FetchSchedule if never used, so if user extends 
 the FetchSchedule and want to get some fields of WebPage, it always return 
 null.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-1563) FetchSchedule#getFields is never used by GeneraterJob

2013-05-29 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng closed NUTCH-1563.
-


 FetchSchedule#getFields is never used by GeneraterJob
 -

 Key: NUTCH-1563
 URL: https://issues.apache.org/jira/browse/NUTCH-1563
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 2.1
Reporter: lufeng
Assignee: lufeng
Priority: Minor
 Fix For: 2.2

 Attachments: NUTCH-1563.patch


 The method of getFields in FetchSchedule if never used, so if user extends 
 the FetchSchedule and want to get some fields of WebPage, it always return 
 null.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1575) support solr authentication in nutch 2.x

2013-05-29 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng resolved NUTCH-1575.
---

Resolution: Fixed

 support solr authentication in nutch 2.x
 

 Key: NUTCH-1575
 URL: https://issues.apache.org/jira/browse/NUTCH-1575
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 2.1
Reporter: lufeng
Assignee: lufeng
Priority: Minor
 Fix For: 2.2

 Attachments: NUTCH-1575.patch


 can solr authentication in nutch 2.x like 1.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1575) support solr authentication in nutch 2.x

2013-05-29 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13669351#comment-13669351
 ] 

lufeng commented on NUTCH-1575:
---

Committed for 2.2 revision 1487521 by Feng. Thanks Lewis

 support solr authentication in nutch 2.x
 

 Key: NUTCH-1575
 URL: https://issues.apache.org/jira/browse/NUTCH-1575
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 2.1
Reporter: lufeng
Assignee: lufeng
Priority: Minor
 Fix For: 2.2

 Attachments: NUTCH-1575.patch


 can solr authentication in nutch 2.x like 1.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch

2013-05-27 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13667766#comment-13667766
 ] 

lufeng commented on NUTCH-1527:
---

Hi luca,sorry for my delayed reply, yes, you can improve this patch follow
you suggestion, can I assign this issue to you, I am willing to testing it.
Thanks. Luca.




-- 
Don't Grow Old, Grow Up... :-)


 Port nutch-elasticsearch-indexer to Nutch
 -

 Key: NUTCH-1527
 URL: https://issues.apache.org/jira/browse/NUTCH-1527
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.6, 2.1
Reporter: Lewis John McGibbney
Assignee: lufeng
Priority: Minor
 Fix For: 2.4

 Attachments: NUTCH-1527.patch


 The source repos for this can be found here [0].
 This issue should be inline with the work already done by Julien and others 
 over at NUTCH-1047.
 [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch

2013-05-27 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1527:
--

Assignee: (was: lufeng)

 Port nutch-elasticsearch-indexer to Nutch
 -

 Key: NUTCH-1527
 URL: https://issues.apache.org/jira/browse/NUTCH-1527
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.6, 2.1
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 2.4

 Attachments: NUTCH-1527.patch


 The source repos for this can be found here [0].
 This issue should be inline with the work already done by Julien and others 
 over at NUTCH-1047.
 [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch

2013-05-27 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13667775#comment-13667775
 ] 

lufeng commented on NUTCH-1527:
---

Hi luca, now you can click assign to me,and then attach you improvement patch, 
thanks luca.

 Port nutch-elasticsearch-indexer to Nutch
 -

 Key: NUTCH-1527
 URL: https://issues.apache.org/jira/browse/NUTCH-1527
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.6, 2.1
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 2.4

 Attachments: NUTCH-1527.patch


 The source repos for this can be found here [0].
 This issue should be inline with the work already done by Julien and others 
 over at NUTCH-1047.
 [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1563) FetchSchedule#getFields is never used by GeneraterJob

2013-05-23 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1563:
--

Fix Version/s: (was: 2.3)
   2.2

 FetchSchedule#getFields is never used by GeneraterJob
 -

 Key: NUTCH-1563
 URL: https://issues.apache.org/jira/browse/NUTCH-1563
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 2.1
Reporter: lufeng
Assignee: lufeng
Priority: Minor
 Fix For: 2.2

 Attachments: NUTCH-1563.patch


 The method of getFields in FetchSchedule if never used, so if user extends 
 the FetchSchedule and want to get some fields of WebPage, it always return 
 null.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1563) FetchSchedule#getFields is never used by GeneraterJob

2013-05-23 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13665161#comment-13665161
 ] 

lufeng commented on NUTCH-1563:
---

hi Tejas

yes, I pushed this pathc to 2.x. 

https://svn.apache.org/repos/asf/nutch/branches/2.x

Do you mean that I pushed to wrong place?

 FetchSchedule#getFields is never used by GeneraterJob
 -

 Key: NUTCH-1563
 URL: https://issues.apache.org/jira/browse/NUTCH-1563
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 2.1
Reporter: lufeng
Assignee: lufeng
Priority: Minor
 Fix For: 2.2

 Attachments: NUTCH-1563.patch


 The method of getFields in FetchSchedule if never used, so if user extends 
 the FetchSchedule and want to get some fields of WebPage, it always return 
 null.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (NUTCH-1575) support solr authentication in nutch 2.x

2013-05-22 Thread lufeng (JIRA)
lufeng created NUTCH-1575:
-

 Summary: support solr authentication in nutch 2.x
 Key: NUTCH-1575
 URL: https://issues.apache.org/jira/browse/NUTCH-1575
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 2.1
Reporter: lufeng
Assignee: lufeng
Priority: Minor
 Fix For: 2.2


can solr authentication in nutch 2.x like 1.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Work started] (NUTCH-1575) support solr authentication in nutch 2.x

2013-05-22 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-1575 started by lufeng.

 support solr authentication in nutch 2.x
 

 Key: NUTCH-1575
 URL: https://issues.apache.org/jira/browse/NUTCH-1575
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 2.1
Reporter: lufeng
Assignee: lufeng
Priority: Minor
 Fix For: 2.2


 can solr authentication in nutch 2.x like 1.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1575) support solr authentication in nutch 2.x

2013-05-22 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1575:
--

Attachment: NUTCH-1575.patch

add solr authentication

 support solr authentication in nutch 2.x
 

 Key: NUTCH-1575
 URL: https://issues.apache.org/jira/browse/NUTCH-1575
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 2.1
Reporter: lufeng
Assignee: lufeng
Priority: Minor
 Fix For: 2.2

 Attachments: NUTCH-1575.patch


 can solr authentication in nutch 2.x like 1.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1563) FetchSchedule#getFields is never used by GeneraterJob

2013-05-20 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13662003#comment-13662003
 ] 

lufeng commented on NUTCH-1563:
---

Committed for 2.2 revision 1484482 by Feng. Thanks Canan and Lewis.

 FetchSchedule#getFields is never used by GeneraterJob
 -

 Key: NUTCH-1563
 URL: https://issues.apache.org/jira/browse/NUTCH-1563
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 2.1
Reporter: lufeng
Assignee: lufeng
Priority: Minor
 Fix For: 2.2

 Attachments: NUTCH-1563.patch


 The method of getFields in FetchSchedule if never used, so if user extends 
 the FetchSchedule and want to get some fields of WebPage, it always return 
 null.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1545) capture batchId and remove references to segments in 2.x crawl script.

2013-05-20 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13662057#comment-13662057
 ] 

lufeng commented on NUTCH-1545:
---

Hi Tejas

yes, the patch is just put random batchId generater from code to crawl script. 
User don't have to bother this.

 capture batchId and remove references to segments in 2.x crawl script.
 --

 Key: NUTCH-1545
 URL: https://issues.apache.org/jira/browse/NUTCH-1545
 Project: Nutch
  Issue Type: Task
Affects Versions: 2.1
Reporter: Lewis John McGibbney
Assignee: lufeng
Priority: Minor
 Fix For: 2.2

 Attachments: NUTCH-1545.patch, NUTCH-1545-v2.patch


 The concept of segment is replaced by batchId in 2.x
 I'm currently getting rid of segments references in 2.x
 This issue was flagged up and separate from NUTCH-1532 which I am working on.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1486) Upgrade to Solr 4.2.1

2013-05-08 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13651936#comment-13651936
 ] 

lufeng commented on NUTCH-1486:
---

Hi Lewis
The dependency version of solr-solrj in pom.xml is still 3.1.0. Should we 
upgrade it to 4.2.1.

 Upgrade to Solr 4.2.1
 -

 Key: NUTCH-1486
 URL: https://issues.apache.org/jira/browse/NUTCH-1486
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.6, 2.1
 Environment: Solr 4.0, Nutch trunk 1.6-SNAPSHOT  Probably 2.2-SNAPHOT
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.7, 2.2

 Attachments: NUTCH-1486-2.x.patch, NUTCH-1486-2.x.v2.patch, 
 NUTCH-1486-nutchgora.patch, NUTCH-1486-trunk.patch, NUTCH-1486-trunk.v2.patch


 When attempting to configure a 4 multicore 4.0 instance with Nutch 
 schema-solr4.xml file, I get the following exceptions.
 This has been discussed previously. As I see it we have two options
 1. Keep maintaining both schema options
 2. Ditch the more complex schema-solr4.xml in favour of vanilla schema.xml
 Thoughts?
 {code}
 SEVERE: Unable to create core: collection4
 org.apache.solr.common.SolrException: Unable to use updateLog: _version_field 
 must exist in schema, using indexed=true stored=true and 
 multiValued=false (_version_ does not exist)
   at org.apache.solr.core.SolrCore.init(SolrCore.java:721)
   at org.apache.solr.core.SolrCore.init(SolrCore.java:566)
   at org.apache.solr.core.CoreContainer.create(CoreContainer.java:850)
   at org.apache.solr.core.CoreContainer.load(CoreContainer.java:534)
   at org.apache.solr.core.CoreContainer.load(CoreContainer.java:356)
   at 
 org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:308)
   at 
 org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:107)
   at org.eclipse.jetty.servlet.FilterHolder.doStart(FilterHolder.java:114)
   at 
 org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
   at 
 org.eclipse.jetty.servlet.ServletHandler.initialize(ServletHandler.java:754)
   at 
 org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:258)
   at 
 org.eclipse.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1221)
   at 
 org.eclipse.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:699)
   at 
 org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:454)
   at 
 org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
   at 
 org.eclipse.jetty.deploy.bindings.StandardStarter.processBinding(StandardStarter.java:36)
   at 
 org.eclipse.jetty.deploy.AppLifeCycle.runBindings(AppLifeCycle.java:183)
   at 
 org.eclipse.jetty.deploy.DeploymentManager.requestAppGoal(DeploymentManager.java:491)
   at 
 org.eclipse.jetty.deploy.DeploymentManager.addApp(DeploymentManager.java:138)
   at 
 org.eclipse.jetty.deploy.providers.ScanningAppProvider.fileAdded(ScanningAppProvider.java:142)
   at 
 org.eclipse.jetty.deploy.providers.ScanningAppProvider$1.fileAdded(ScanningAppProvider.java:53)
   at org.eclipse.jetty.util.Scanner.reportAddition(Scanner.java:604)
   at org.eclipse.jetty.util.Scanner.reportDifferences(Scanner.java:535)
   at org.eclipse.jetty.util.Scanner.scan(Scanner.java:398)
   at org.eclipse.jetty.util.Scanner.doStart(Scanner.java:332)
   at 
 org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
   at 
 org.eclipse.jetty.deploy.providers.ScanningAppProvider.doStart(ScanningAppProvider.java:118)
   at 
 org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
   at 
 org.eclipse.jetty.deploy.DeploymentManager.startAppProvider(DeploymentManager.java:552)
   at 
 org.eclipse.jetty.deploy.DeploymentManager.doStart(DeploymentManager.java:227)
   at 
 org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
   at 
 org.eclipse.jetty.util.component.AggregateLifeCycle.doStart(AggregateLifeCycle.java:63)
   at 
 org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:53)
   at 
 org.eclipse.jetty.server.handler.HandlerWrapper.doStart(HandlerWrapper.java:91)
   at org.eclipse.jetty.server.Server.doStart(Server.java:263)
   at 
 org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
   at 
 org.eclipse.jetty.xml.XmlConfiguration$1.run(XmlConfiguration.java:1215)
   at java.security.AccessController.doPrivileged(Native Method)
   at 
 org.eclipse.jetty.xml.XmlConfiguration.main(XmlConfiguration.java:1138)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 

[jira] [Assigned] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch

2013-05-08 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng reassigned NUTCH-1527:
-

Assignee: lufeng

 Port nutch-elasticsearch-indexer to Nutch
 -

 Key: NUTCH-1527
 URL: https://issues.apache.org/jira/browse/NUTCH-1527
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.6, 2.1
Reporter: Lewis John McGibbney
Assignee: lufeng
Priority: Minor
 Fix For: 2.3, 1.8


 The source repos for this can be found here [0].
 This issue should be inline with the work already done by Julien and others 
 over at NUTCH-1047.
 [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch

2013-05-08 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1527:
--

Attachment: NUTCH-1527.patch

port elasticsearch indexer plugin to nutch trunk. Before u install this patch, 
you need to install the https://issues.apache.org/jira/browse/NUTCH-1486 first. 
so I use the newest version of elasticsearch 0.90.0. It use the lucene 4.2.1. I 
need more testing about this patch, I am a newbie to elastchsearch. Hope any 
comments about this patch.

thanks Lewis.

 Port nutch-elasticsearch-indexer to Nutch
 -

 Key: NUTCH-1527
 URL: https://issues.apache.org/jira/browse/NUTCH-1527
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.6, 2.1
Reporter: Lewis John McGibbney
Assignee: lufeng
Priority: Minor
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1527.patch


 The source repos for this can be found here [0].
 This issue should be inline with the work already done by Julien and others 
 over at NUTCH-1047.
 [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1555) Move to commons-cli for command line parsing

2013-04-25 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1555:
--

Attachment: NUTCH-1555-v1.patch

Lewis:
1. fixed the fetch NPE bug
2. fixed the update not work bug

Should we put every tools to use commons-cli? I find that there are 47 files 
need to be moved.

[~wastl-nagel]
1. use eclipse-codeformat.xml to format the source code

Thanks Lewis and Sebastian.

 Move to commons-cli for command line parsing 
 -

 Key: NUTCH-1555
 URL: https://issues.apache.org/jira/browse/NUTCH-1555
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 2.1
Reporter: Lewis John McGibbney
Assignee: lufeng
 Fix For: 2.2

 Attachments: NUTCH-1555.patch, NUTCH-1555-v1.patch


 I just accidentally passed in the following argument to parser job
 {code}
 law@CEE279Law3-Linux:~/Downloads/asf/2.x/runtime/local$ ./bin/nutch parse 
 updatedb
 ParserJob: starting
 ParserJob: resuming:  false
 ParserJob: forced reparse:false
 ParserJob: batchId:   updatedb
 ParserJob: success
 {code}
 This is a bug for sure

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (NUTCH-1555) Move to commons-cli for command line parsing

2013-04-25 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13641869#comment-13641869
 ] 

lufeng edited comment on NUTCH-1555 at 4/25/13 2:48 PM:


Lewis:
1. fixed the fetch NPE bug
2. fixed the update not work bug

Should we put every tools to use commons-cli? I find that there are 47 files 
need to be moved.

Sebastian:
1. use eclipse-codeformat.xml to format the source code

Thanks Lewis and Sebastian.

  was (Author: amuseme.lu):
Lewis:
1. fixed the fetch NPE bug
2. fixed the update not work bug

Should we put every tools to use commons-cli? I find that there are 47 files 
need to be moved.

[~wastl-nagel]
1. use eclipse-codeformat.xml to format the source code

Thanks Lewis and Sebastian.
  
 Move to commons-cli for command line parsing 
 -

 Key: NUTCH-1555
 URL: https://issues.apache.org/jira/browse/NUTCH-1555
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 2.1
Reporter: Lewis John McGibbney
Assignee: lufeng
 Fix For: 2.2

 Attachments: NUTCH-1555.patch, NUTCH-1555-v1.patch


 I just accidentally passed in the following argument to parser job
 {code}
 law@CEE279Law3-Linux:~/Downloads/asf/2.x/runtime/local$ ./bin/nutch parse 
 updatedb
 ParserJob: starting
 ParserJob: resuming:  false
 ParserJob: forced reparse:false
 ParserJob: batchId:   updatedb
 ParserJob: success
 {code}
 This is a bug for sure

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (NUTCH-1555) Move to commons-cli for command line parsing

2013-04-23 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13639131#comment-13639131
 ] 

lufeng edited comment on NUTCH-1555 at 4/23/13 2:58 PM:


already moved following files command line parsing to commons-cli,because they 
are used in bin/nutch command line. 

{code:java}
src/java/org/apache/nutch/api/NutchServer.java
src/java/org/apache/nutch/crawl/DbUpdaterJob.java
src/java/org/apache/nutch/crawl/GeneratorJob.java
src/java/org/apache/nutch/crawl/InjectorJob.java
src/java/org/apache/nutch/crawl/WebTableReader.java
src/java/org/apache/nutch/fetcher/FetcherJob.java
src/java/org/apache/nutch/host/HostDbReader.java
src/java/org/apache/nutch/host/HostDbUpdateJob.java
src/java/org/apache/nutch/host/HostInjectorJob.java
src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java
src/java/org/apache/nutch/indexer/elastic/ElasticIndexerJob.java
src/java/org/apache/nutch/indexer/solr/SolrDeleteDuplicates.java
src/java/org/apache/nutch/indexer/solr/SolrIndexerJob.java
src/java/org/apache/nutch/parse/ParserChecker.java
src/java/org/apache/nutch/parse/ParserJob.java
src/java/org/apache/nutch/plugin/PluginRepository.java
{code}

  was (Author: amuseme.lu):
already moved the command line parsing to commons-cli,because they are used 
in bin/nutch command line. 

{code:java}
src/java/org/apache/nutch/api/NutchServer.java
src/java/org/apache/nutch/crawl/DbUpdaterJob.java
src/java/org/apache/nutch/crawl/GeneratorJob.java
src/java/org/apache/nutch/crawl/InjectorJob.java
src/java/org/apache/nutch/crawl/WebTableReader.java
src/java/org/apache/nutch/fetcher/FetcherJob.java
src/java/org/apache/nutch/host/HostDbReader.java
src/java/org/apache/nutch/host/HostDbUpdateJob.java
src/java/org/apache/nutch/host/HostInjectorJob.java
src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java
src/java/org/apache/nutch/indexer/elastic/ElasticIndexerJob.java
src/java/org/apache/nutch/indexer/solr/SolrDeleteDuplicates.java
src/java/org/apache/nutch/indexer/solr/SolrIndexerJob.java
src/java/org/apache/nutch/parse/ParserChecker.java
src/java/org/apache/nutch/parse/ParserJob.java
src/java/org/apache/nutch/plugin/PluginRepository.java
{code}
  
 Move to commons-cli for command line parsing 
 -

 Key: NUTCH-1555
 URL: https://issues.apache.org/jira/browse/NUTCH-1555
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 2.1
Reporter: Lewis John McGibbney
Assignee: lufeng
 Fix For: 2.2

 Attachments: NUTCH-1555.patch


 I just accidentally passed in the following argument to parser job
 {code}
 law@CEE279Law3-Linux:~/Downloads/asf/2.x/runtime/local$ ./bin/nutch parse 
 updatedb
 ParserJob: starting
 ParserJob: resuming:  false
 ParserJob: forced reparse:false
 ParserJob: batchId:   updatedb
 ParserJob: success
 {code}
 This is a bug for sure

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1562) Order of execution for scoring filters

2013-04-20 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13637247#comment-13637247
 ] 

lufeng commented on NUTCH-1562:
---

Hi Julien, if someone define the scoring.filter.order like opic,depth filters 
and these filters are not included in plugin.includes property, maybe forget 
it. it will throw an exception like this. 

{code:java}
java.lang.NullPointerException
at 
org.apache.nutch.scoring.ScoringFilters.injectedScore(ScoringFilters.java:112)
at org.apache.nutch.crawl.Injector$InjectMapper.map(Injector.java:164)
at org.apache.nutch.crawl.Injector$InjectMapper.map(Injector.java:63)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
2013-04-20 21:19:10,983 ERROR crawl.Injector - Injector: java.io.IOException: 
Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1327)
at org.apache.nutch.crawl.Injector.inject(Injector.java:281)
at org.apache.nutch.crawl.Injector.run(Injector.java:318)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Injector.main(Injector.java:308)

{code}

Should we consider this situation or not? 

 Order of execution for scoring filters
 --

 Key: NUTCH-1562
 URL: https://issues.apache.org/jira/browse/NUTCH-1562
 Project: Nutch
  Issue Type: Bug
  Components: documentation
Affects Versions: 1.6, 2.1
Reporter: Julien Nioche
 Fix For: 1.7, 2.2

 Attachments: NUTCH-1562-trunk.patch


 The documentation in nutch-default.xml states that :
 {quote}
 property
   namescoring.filter.order/name
   value/value
   descriptionThe order in which scoring filters are applied.
   This may be left empty (in which case all available scoring
   filters will be applied in the order defined in plugin-includes
   and plugin-excludes), or a space separated list of implementation
   classes.
   /description
 /property
 {quote}
 however if no order is specified the filters are ordered randomly and not in 
 the order defined in plugin-includes.
 The other *order parameters (e.g. urlfilter.order) have a different 
 documentation and are loaded and applied in system defined order which 
 corresponds to what the code does.
 The patch attached is for 1.x and puts the code in accordance with the 
 documentation by ordering the filters according to the order of the plugins, 
 which gives users more control without having to specify the classes 
 explicitly in scoring.filter.order.
 We could extend the same idea to the other *order params.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (NUTCH-1563) FetchSchedule#getFields is never used by GeneraterJob

2013-04-20 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng reassigned NUTCH-1563:
-

Assignee: lufeng

 FetchSchedule#getFields is never used by GeneraterJob
 -

 Key: NUTCH-1563
 URL: https://issues.apache.org/jira/browse/NUTCH-1563
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 2.1
Reporter: lufeng
Assignee: lufeng
Priority: Minor
 Fix For: 2.2

 Attachments: NUTCH-1563.patch


 The method of getFields in FetchSchedule if never used, so if user extends 
 the FetchSchedule and want to get some fields of WebPage, it always return 
 null.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (NUTCH-1563) FetchSchedule#getFields is never used by GeneraterJob

2013-04-18 Thread lufeng (JIRA)
lufeng created NUTCH-1563:
-

 Summary: FetchSchedule#getFields is never used by GeneraterJob
 Key: NUTCH-1563
 URL: https://issues.apache.org/jira/browse/NUTCH-1563
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 2.1
Reporter: lufeng
Priority: Minor
 Fix For: 2.2


The method of getFields in FetchSchedule if never used, so if user extends the 
FetchSchedule and want to get some fields of WebPage, it always return null.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (NUTCH-1555) Move to commons-cli for command line parsing

2013-04-16 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng reassigned NUTCH-1555:
-

Assignee: lufeng

 Move to commons-cli for command line parsing 
 -

 Key: NUTCH-1555
 URL: https://issues.apache.org/jira/browse/NUTCH-1555
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 2.1
Reporter: Lewis John McGibbney
Assignee: lufeng
 Fix For: 2.2


 I just accidentally passed in the following argument to parser job
 {code}
 law@CEE279Law3-Linux:~/Downloads/asf/2.x/runtime/local$ ./bin/nutch parse 
 updatedb
 ParserJob: starting
 ParserJob: resuming:  false
 ParserJob: forced reparse:false
 ParserJob: batchId:   updatedb
 ParserJob: success
 {code}
 This is a bug for sure

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1555) bug in 2.x ParserJob command line parsing

2013-04-10 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13627917#comment-13627917
 ] 

lufeng commented on NUTCH-1555:
---

Hi Lewis, yes, like you said that we can choose an established CLI framework to 
enforce more checking. when we use a CLI framework, maybe the command output 
like this. 

{code:java}
law@CEE279Law3-Linux:~/Downloads/asf/2.x/runtime/local$ ./bin/nutch parse 
-batchId updatedb
ParserJob: starting
ParserJob: resuming:false
ParserJob: forced reparse:  false
ParserJob: batchId: updatedb
ParserJob: success
{code}

we can not guarantee that user input parameter values are all correct. or maybe 
the fast way to fixed this bug is to add -batchId to parse command. but use CLI 
framework is a good idea, it can let us parsing command line options more 
easily. 

I am +1 to port all command line parsing to CLI framework. 

 bug in 2.x ParserJob command line parsing 
 --

 Key: NUTCH-1555
 URL: https://issues.apache.org/jira/browse/NUTCH-1555
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 2.1
Reporter: Lewis John McGibbney
 Fix For: 2.2


 I just accidentally passed in the following argument to parser job
 {code}
 law@CEE279Law3-Linux:~/Downloads/asf/2.x/runtime/local$ ./bin/nutch parse 
 updatedb
 ParserJob: starting
 ParserJob: resuming:  false
 ParserJob: forced reparse:false
 ParserJob: batchId:   updatedb
 ParserJob: success
 {code}
 This is a bug for sure

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1555) bug in 2.x ParserJob command line parsing

2013-04-08 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13625432#comment-13625432
 ] 

lufeng commented on NUTCH-1555:
---

Hi Lewis, as you said that FetchJob also has this bug too. command running 
result like this

{code:java} 
lemo@debian:~/Workspace/java/apache-workspace/nutch2.x-svn/runtime/local$ 
bin/nutch fetch updatedb
FetcherJob: starting
FetcherJob: batchId: updatedb
Fetcher: Your 'http.agent.name' value should be listed first in 
'http.robots.agents' property.
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
{code}

because the type of batchId is a string. 

 bug in 2.x ParserJob command line parsing 
 --

 Key: NUTCH-1555
 URL: https://issues.apache.org/jira/browse/NUTCH-1555
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 2.1
Reporter: Lewis John McGibbney
 Fix For: 2.2


 I just accidentally passed in the following argument to parser job
 {code}
 law@CEE279Law3-Linux:~/Downloads/asf/2.x/runtime/local$ ./bin/nutch parse 
 updatedb
 ParserJob: starting
 ParserJob: resuming:  false
 ParserJob: forced reparse:false
 ParserJob: batchId:   updatedb
 ParserJob: success
 {code}
 This is a bug for sure

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1545) capture batchId and remove references to segments in 2.x crawl script.

2013-04-06 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1545:
--

Attachment: NUTCH-1545-v2.patch

1. remove any concept of crawldb and segments in bin/crawl script
2. fix the capture batchID in bin/crawl script through add an argument in 
GenerateJob class. It will get an batchId if necessary.

any comments please.

 capture batchId and remove references to segments in 2.x crawl script.
 --

 Key: NUTCH-1545
 URL: https://issues.apache.org/jira/browse/NUTCH-1545
 Project: Nutch
  Issue Type: Task
Affects Versions: 2.1
Reporter: Lewis John McGibbney
Assignee: lufeng
Priority: Minor
 Fix For: 2.2

 Attachments: NUTCH-1545.patch, NUTCH-1545-v2.patch


 The concept of segment is replaced by batchId in 2.x
 I'm currently getting rid of segments references in 2.x
 This issue was flagged up and separate from NUTCH-1532 which I am working on.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1547) BasicIndexingFilter - Problem to index full title

2013-03-28 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng resolved NUTCH-1547.
---

Resolution: Fixed

 BasicIndexingFilter - Problem to index full title
 -

 Key: NUTCH-1547
 URL: https://issues.apache.org/jira/browse/NUTCH-1547
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.6, 2.1
Reporter: Gustavo Rauber
Assignee: lufeng
Priority: Minor
 Fix For: 1.7, 2.2

 Attachments: NUTCH-1547-2x.patch, NUTCH-1547.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 I have faced this issue when trying to index the entire title, just like the 
 content, configuring its value on nutch-default.xml to -1 
 (indexer.max.title.length). I think the behavior should be the same as the 
 content.
 If you would like to fix it, just replace the line number 90:
 if (title.length()  MAX_TITLE_LENGTH) {  // truncate title if needed
 by this one:
 if (MAX_TITLE_LENGTH  -1  title.length()  MAX_TITLE_LENGTH) {  // 
 truncate title if needed
 Stack Trace:
 java.lang.StringIndexOutOfBoundsException: String index out of range: -1
   at java.lang.String.substring(String.java:1937)
   at 
 org.apache.nutch.indexer.basic.BasicIndexingFilter.filter(BasicIndexingFilter.java:91)
   at 
 org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:109)
   at 
 org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:272)
   at 
 org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:53)
   at 
 org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:519)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
   at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260)
 Cheers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1547) BasicIndexingFilter - Problem to index full title

2013-03-28 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616227#comment-13616227
 ] 

lufeng commented on NUTCH-1547:
---

Feng Committed revision 1462078 to trunk and 2.x revision 1462079.


 BasicIndexingFilter - Problem to index full title
 -

 Key: NUTCH-1547
 URL: https://issues.apache.org/jira/browse/NUTCH-1547
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.6, 2.1
Reporter: Gustavo Rauber
Assignee: lufeng
Priority: Minor
 Fix For: 1.7, 2.2

 Attachments: NUTCH-1547-2x.patch, NUTCH-1547.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 I have faced this issue when trying to index the entire title, just like the 
 content, configuring its value on nutch-default.xml to -1 
 (indexer.max.title.length). I think the behavior should be the same as the 
 content.
 If you would like to fix it, just replace the line number 90:
 if (title.length()  MAX_TITLE_LENGTH) {  // truncate title if needed
 by this one:
 if (MAX_TITLE_LENGTH  -1  title.length()  MAX_TITLE_LENGTH) {  // 
 truncate title if needed
 Stack Trace:
 java.lang.StringIndexOutOfBoundsException: String index out of range: -1
   at java.lang.String.substring(String.java:1937)
   at 
 org.apache.nutch.indexer.basic.BasicIndexingFilter.filter(BasicIndexingFilter.java:91)
   at 
 org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:109)
   at 
 org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:272)
   at 
 org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:53)
   at 
 org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:519)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
   at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260)
 Cheers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1538) tuning of loaded fields during fetcherJob start-up

2013-03-28 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616250#comment-13616250
 ] 

lufeng commented on NUTCH-1538:
---

yes, However, we can not guarantee that other plugin that extended by user will 
be use to the corresponding field values​​ in WebPage class. 

 tuning of loaded fields during fetcherJob start-up
 --

 Key: NUTCH-1538
 URL: https://issues.apache.org/jira/browse/NUTCH-1538
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 2.1
 Environment: nutch 2.1 / cassandra 1.2.1 / gora-cassandra 0.2 / 
 gora-core 0.2.1 
 running fetch with parse=true
Reporter: Roland von Herget
 Attachments: NUTCH-1538-FetcherJob-v1.patch


 Main problem is, nutch is loading nearly every row  column from DB during 
 startup of a fetcherJob when fetcher.parse=true.
 A parserJob needs e.g. the CONTENT field from db, to parse.
 The fetcherJob adds all fields of the parserJob to it's needed fields, if 
 running with fetcher.parse=true. [FetcherJob.getFields()]
 If the nutch configuration saves all fetched data to DB 
 (fetcher.store.content=true) you'll end up loading GBs of unused content 
 during fetcherJob start-up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1547) BasicIndexingFilter - Problem to index full title

2013-03-27 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1547:
--

Attachment: NUTCH-1547-2x.patch

add patch to Nutch 2.x

 BasicIndexingFilter - Problem to index full title
 -

 Key: NUTCH-1547
 URL: https://issues.apache.org/jira/browse/NUTCH-1547
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.6, 2.1
Reporter: Gustavo Rauber
Assignee: lufeng
Priority: Minor
 Fix For: 1.7, 2.2

 Attachments: NUTCH-1547-2x.patch, NUTCH-1547.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 I have faced this issue when trying to index the entire title, just like the 
 content, configuring its value on nutch-default.xml to -1 
 (indexer.max.title.length). I think the behavior should be the same as the 
 content.
 If you would like to fix it, just replace the line number 90:
 if (title.length()  MAX_TITLE_LENGTH) {  // truncate title if needed
 by this one:
 if (MAX_TITLE_LENGTH  -1  title.length()  MAX_TITLE_LENGTH) {  // 
 truncate title if needed
 Stack Trace:
 java.lang.StringIndexOutOfBoundsException: String index out of range: -1
   at java.lang.String.substring(String.java:1937)
   at 
 org.apache.nutch.indexer.basic.BasicIndexingFilter.filter(BasicIndexingFilter.java:91)
   at 
 org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:109)
   at 
 org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:272)
   at 
 org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:53)
   at 
 org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:519)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
   at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260)
 Cheers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1389) parsechecker and indexchecker to report truncated content

2013-03-27 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615360#comment-13615360
 ] 

lufeng commented on NUTCH-1389:
---

+1 Sebstian

 parsechecker and indexchecker to report truncated content
 -

 Key: NUTCH-1389
 URL: https://issues.apache.org/jira/browse/NUTCH-1389
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, parser
Affects Versions: nutchgora, 1.5
Reporter: Sebastian Nagel
Priority: Minor
 Fix For: 1.7, 2.2

 Attachments: NUTCH-1389-2x.patch, NUTCH-1389-trunk.patch


 ParserChecker and IndexingFiltersChecker should report when a document is 
 truncated due to {http,file,ftp}.content.limit.
 Truncated content may cause text and metadata extraction to fail for PDF and 
 other binary document formats.
 A hint that truncation (and not a broken plugin) is the possible reason would 
 be useful.
 See NUTCH-965 and {{ParseSegment.isTruncated(content)}}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


  1   2   >