[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher

2015-04-08 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485299#comment-14485299
 ] 

lufeng commented on NUTCH-1854:
---

if we set "fetcher.store.content=false" and "fetcher.parse=false" then the 
"bin/nutch parse" command will throw exception to check the input content 
directory exist. So I think why we need this parameter because something we set 
the "fetcher.parse" to true and don't want to store the content because of slow 
disk or not much disk space. So I think we can remove this parameter of 
"fetcher.store.content" and if the parameter of "fetcher.parse=true" we don't 
store the page content.

> ./bin/crawl fails with a parsing fetcher
> 
>
> Key: NUTCH-1854
> URL: https://issues.apache.org/jira/browse/NUTCH-1854
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11
>
> Attachments: NUTCH-1854ver1.patch
>
>
> If you run ./bin/crawl with a parsing fetcher e.g.
> 
> >   fetcher.parse
> >   false
> >   If true, fetcher will parse content. Default is false,
> > which means
> >   that a separate parsing step is required after fetching is
> > finished.
> > 
> we get a horrible message as follows
> Exception in thread "main" java.io.IOException: Segment already parsed!
> We could improve this by making logging more complete and by adding a trigger 
> to the crawl script which would check for crawl_parse for a given segment and 
> then skip parsing if this is present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (NUTCH-1939) Fetcher fails to follow redirects

2015-02-10 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14315374#comment-14315374
 ] 

lufeng edited comment on NUTCH-1939 at 2/11/15 2:16 AM:


I think that's correct. +1


was (Author: amuseme.lu):
Hi Sebastian

One question. How do you use the FetchItem returned by "queueRedirect" method. 
I don't find any code to use this returned object. I think "queueRedirect" 
method has already add this redirect url back to fetch queue.

> Fetcher fails to follow redirects
> -
>
> Key: NUTCH-1939
> URL: https://issues.apache.org/jira/browse/NUTCH-1939
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.9
>Reporter: Sebastian Nagel
> Fix For: 1.10
>
> Attachments: NUTCH-1939.patch
>
>
> As reported by [~leoyey] in NUTCH-1735 which introduced the regression: with 
> http.redirect.max > 0 Fetcher does not follow redirects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1939) Fetcher fails to follow redirects

2015-02-10 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14315374#comment-14315374
 ] 

lufeng commented on NUTCH-1939:
---

Hi Sebastian

One question. How do you use the FetchItem returned by "queueRedirect" method. 
I don't find any code to use this returned object. I think "queueRedirect" 
method has already add this redirect url back to fetch queue.

> Fetcher fails to follow redirects
> -
>
> Key: NUTCH-1939
> URL: https://issues.apache.org/jira/browse/NUTCH-1939
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.9
>Reporter: Sebastian Nagel
> Fix For: 1.10
>
> Attachments: NUTCH-1939.patch
>
>
> As reported by [~leoyey] in NUTCH-1735 which introduced the regression: with 
> http.redirect.max > 0 Fetcher does not follow redirects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1829) Generator : unable to distinguish real errors

2014-08-25 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14110193#comment-14110193
 ] 

lufeng commented on NUTCH-1829:
---

yes, I think we should distinguish different return result using different 
return code. So we can determine the next action according to this return code. 

> Generator : unable to distinguish real errors
> -
>
> Key: NUTCH-1829
> URL: https://issues.apache.org/jira/browse/NUTCH-1829
> Project: Nutch
>  Issue Type: Bug
>  Components: nutchNewbie
>Affects Versions: 1.9, 2.2.1
> Environment: Ubuntu Server 14.04, OpenJDK 7
>Reporter: Mathieu Bouchard
>
> The bin/nutch generate command is returning the same error code (-1) if there 
> is an error or no new segment to process, so there is no way to tell if the 
> error is real or not from a shell script. This problem is related to 
> NUTCH-1828.
> The problem can be fixed by modifying the following Java source file:
> http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java?revision=1619934&view=markup
> At line 711, if there are no new segment, the generator returns -1, which is 
> the same return code returned at line 714 if there was an error.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1828) bin/crawl : incorrect handling of nutch errors

2014-08-25 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14110177#comment-14110177
 ] 

lufeng commented on NUTCH-1828:
---

Can you provide a patch for Nutch 2.x? I found this issue has also effect 
Nutch2.x. 

Thanks Mathieu.

> bin/crawl : incorrect handling of nutch errors
> --
>
> Key: NUTCH-1828
> URL: https://issues.apache.org/jira/browse/NUTCH-1828
> Project: Nutch
>  Issue Type: Bug
>  Components: nutchNewbie
>Affects Versions: 1.9, 2.2.1
> Environment: Ubuntu Server 14.04, OpenJDK 7
>Reporter: Mathieu Bouchard
> Attachments: apache-nutch-1.9-crawl-fix-retcode.patch
>
>
> We are using Solr with Nutch to provide a complete search engine for our 
> website.
> I created a cron job that would use Nutch to crawl and update the Solr index 
> each night. This cron job is trying to automatically correct some errors that 
> could result in a corrupt crawldb. However, it seems that the bin/crawl 
> command doesn't correctly propagate errors coming from bin/nutch.
> Here is an exemple from the bin/crawl script :
> $bin/nutch inject $CRAWL_PATH/crawldb $SEEDDIR
> if [ $? -ne 0 ]
>   then exit $?
> fi
> Even if there is an error in the nutch inject command, the crawl script 
> always returns 0. The way I understand it, the exit code returned is the 
> result of the shell test and not the result of the nutch inject command.
> To correct this, we would need to modify the script with something like :
> $bin/nutch inject $CRAWL_PATH/crawldb $SEEDDIR
> RETCODE=$?
> if [ $RETCODE -ne 0 ]
>   then exit $RETCODE
> fi



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-385) Improve description of thread related configuration for Fetcher

2014-06-26 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14045525#comment-14045525
 ] 

lufeng commented on NUTCH-385:
--

Hi Julien

I see the description of "fetcher.threads.per.queue" we can add setting 
"fetcher.threads.per.queue" to value > 1 will also cause "fetcher.server.delay" 
to be ignore. 

Another issue is that I think this property "fetcher.max.crawl.delay" is not 
uniform with "fetcher.server.delay" and "fetcher.server.min.delay". It is 
changed to "fetcher.server.max.delay" more suitable?


> Improve description of thread related configuration for Fetcher
> ---
>
> Key: NUTCH-385
> URL: https://issues.apache.org/jira/browse/NUTCH-385
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation, fetcher
>Reporter: Chris Schneider
>Assignee: Julien Nioche
> Fix For: 1.9
>
> Attachments: NUTCH-385.patch
>
>
> For some time I've been puzzled by the interaction between two paramters that 
> control how often the fetcher can access a particular host:
> 1) The server delay, which comes back from the remote server during our 
> processing of the robots.txt file, and which can be limited by 
> fetcher.max.crawl.delay.
> 2) The fetcher.threads.per.host value, particularly when this is greater than 
> the default of 1.
> According to my (limited) understanding of the code in HttpBase.java:
> Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher 
> ends up keeping either 1 or 2 fetcher threads pointing at a particular host 
> continuously. In other words, it never tries to point 3 at the host, and it 
> always points a second thread at the host before the first thread finishes 
> accessing it. Since HttpBase.unblockAddr never gets called with 
> (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts 
> System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
> host. Thus, the server delay will never be used at all. The fetcher will be 
> continuously retrieving pages from the host, often with 2 fetchers accessing 
> the host simultaneously.
> Suppose instead that the fetcher finally does allow the last thread to 
> complete before it gets around to pointing another thread at the target host. 
> When the last fetcher thread calls HttpBase.unblockAddr, it will now put 
> System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
> host. This, in turn, will prevent any threads from accessing this host until 
> the delay is complete, even though zero threads are currently accessing the 
> host.
> I see this behavior as inconsistent. More importantly, the current 
> implementation certainly doesn't seem to answer my original question about 
> appropriate definitions for what appear to be conflicting parameters. 
> In a nutshell, how could we possibly honor the server delay if we allow more 
> than one fetcher thread to simultaneously access the host?
> It would be one thing if whenever (fetcher.threads.per.host > 1), this 
> trumped the server delay, causing the latter to be ignored completely. That 
> is certainly not the case in the current implementation, as it will wait for 
> server delay whenever the number of threads accessing a given host drops to 
> zero.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1785) Ability to index raw content

2014-05-28 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010889#comment-14010889
 ] 

lufeng commented on NUTCH-1785:
---

+1 elasticsearch 1.2.0 test ok. 

one question is why convert content byte[] to String type? If one segment 
contain both html and PDF or mp3 content. How to set this base64 parameter? 

> Ability to index raw content
> 
>
> Key: NUTCH-1785
> URL: https://issues.apache.org/jira/browse/NUTCH-1785
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.9
>
> Attachments: NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, 
> NUTCH-1785-trunk.patch
>
>
> Some use-cases require Nutch to actually write the raw content a configured 
> indexing back-end. Since Content is never read, a plugin is out of the 
> question and therefore we need to force IndexJob to process Content as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (NUTCH-1521) CrawlDbFilter pass null url to urlNormailzers

2014-04-16 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng closed NUTCH-1521.
-

   Resolution: Fixed
Fix Version/s: (was: 2.4)
   1.9

> CrawlDbFilter pass null url to urlNormailzers
> -
>
> Key: NUTCH-1521
> URL: https://issues.apache.org/jira/browse/NUTCH-1521
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: lufeng
>Assignee: lufeng
>Priority: Trivial
> Fix For: 1.9
>
> Attachments: CrawlDbFilter_v1.patch, NUTCH-1521-trunk.patch, 
> TestCrawlDbFilter.java
>
>
> urlNormalizers will get null url if we set CRAWLDB_PURGE_404, and it will 
> throw NullPointerException. and the WARN Log will output something like this 
> "Skipping null NullPointerException".



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1726) HeadingsFilter does not find nested nodes

2014-04-15 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969601#comment-13969601
 ] 

lufeng commented on NUTCH-1726:
---

Hi all, Can someone free to check this patch? thanks. 

> HeadingsFilter does not find nested nodes
> -
>
> Key: NUTCH-1726
> URL: https://issues.apache.org/jira/browse/NUTCH-1726
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.9
>
> Attachments: NUTCH-1726-trunk-v2.patch, NUTCH-1726-trunk.patch, 
> NUTCH-1726-trunk.patch
>
>
> Filter won't find:
> {code}
> apache nutch
> {code}
> The getNodeValue() tries to read data from children but should traverse nodes 
> instead.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1752) cache robots.txt rules per protocol:host:port

2014-04-09 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964219#comment-13964219
 ] 

lufeng commented on NUTCH-1752:
---

Do you mean different port with same protocol and host has different robots.txt 
file?

+1 


> cache robots.txt rules per protocol:host:port
> -
>
> Key: NUTCH-1752
> URL: https://issues.apache.org/jira/browse/NUTCH-1752
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Affects Versions: 1.8, 2.2.1
>Reporter: Sebastian Nagel
> Fix For: 2.3, 1.9
>
> Attachments: NUTCH-1752-v1.patch
>
>
> HttpRobotRulesParser caches rules from {{robots.txt}} per "protocol:host" 
> (before NUTCH-1031 caching was per "host" only). The caching should be per 
> "protocol:host:port". In doubt, a request to a different port may deliver a 
> different {{robots.txt}}. 
> Applying robots.txt rules to a combination of host, protocol, and port is 
> common practice:
> [Norobots RFC 1996 draft|http://www.robotstxt.org/norobots-rfc.txt] does not 
> mention this explicitly (could be derived from examples) but others do:
> * [Wikipedia|http://en.wikipedia.org/wiki/Robots.txt]: "each protocol and 
> port needs its own robots.txt file"
> * [Google 
> webmasters|https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt]:
>  "The directives listed in the robots.txt file apply only to the host, 
> protocol and port number where the file is hosted."



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1733) parse-html to support HTML5 charset definitions

2014-03-17 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938867#comment-13938867
 ] 

lufeng commented on NUTCH-1733:
---

+1 pass all tests

> parse-html to support HTML5 charset definitions
> ---
>
> Key: NUTCH-1733
> URL: https://issues.apache.org/jira/browse/NUTCH-1733
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.8, 2.2.1
>Reporter: Sebastian Nagel
> Fix For: 2.3, 1.9
>
> Attachments: NUTCH-1733-trunk.patch, charset_bom_html5.html, 
> charset_html5.html
>
>
> HTML 5 allows to specify the character encoding of a page per
> * {{}}
> * Unicode Byte Order Mark (BOM)
> These are allowed in addition to previous HTTP/http-equiv Content-Type, see 
> [[1|http://www.w3.org/TR/2011/WD-html5-diff-20110405/#character-encoding]].
> Parse-html ignores both meta charset and BOM, falls back to the default 
> encoding (cp1252). Parse-tika sets the encoding appropriately.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1736) Can't fetch page if http response header contains Transfer-Encoding:chunked

2014-03-16 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13937426#comment-13937426
 ] 

lufeng commented on NUTCH-1736:
---

Hi ysc

you can check the content size to fix this issue like this. 

{code:java}
if (http.getMaxContent() >= 0 && (contentBytesRead + chunkLen) > 
http.getMaxContent() )
  chunkLen= http.getMaxContent() - contentBytesRead;
{code}

> Can't fetch page if http response header contains Transfer-Encoding:chunked
> ---
>
> Key: NUTCH-1736
> URL: https://issues.apache.org/jira/browse/NUTCH-1736
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Affects Versions: 1.6, 2.1, 1.7, 2.2, 2.3, 1.8, 2.4, 1.9, 2.2.1
>Reporter: ysc
>Priority: Critical
> Fix For: 2.3, 1.9
>
> Attachments: nutch-2.2.1.patch, nutch1.7.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> fetching: 
> http://szs.mof.gov.cn/zhengwuxinxi/zhengcefabu/201402/t20140224_1046354.html
> Fetch failed with protocol status: EXCEPTION: java.io.IOException: 
> unzipBestEffort returned null



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1736) Can't fetch page if http response header contains Transfer-Encoding:chunked

2014-03-16 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13937418#comment-13937418
 ] 

lufeng commented on NUTCH-1736:
---

Hi Sebastian, I think this patch is not related to NUTCH-1647, maybe they have 
same exception error result. NUTCH-1647 is about url redirection issue. 





> Can't fetch page if http response header contains Transfer-Encoding:chunked
> ---
>
> Key: NUTCH-1736
> URL: https://issues.apache.org/jira/browse/NUTCH-1736
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Affects Versions: 1.6, 2.1, 1.7, 2.2, 2.3, 1.8, 2.4, 1.9, 2.2.1
>Reporter: ysc
>Priority: Critical
> Fix For: 2.3, 1.9
>
> Attachments: nutch-2.2.1.patch, nutch1.7.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> fetching: 
> http://szs.mof.gov.cn/zhengwuxinxi/zhengcefabu/201402/t20140224_1046354.html
> Fetch failed with protocol status: EXCEPTION: java.io.IOException: 
> unzipBestEffort returned null



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (NUTCH-1726) HeadingsFilter does not find nested nodes

2014-02-24 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13910355#comment-13910355
 ] 

lufeng edited comment on NUTCH-1726 at 2/24/14 2:41 PM:


Hi Markus

It seems that HeadingsFilter does not find nested nodes in my testing code. but 
I can not restore your testing result when I use following process to testing 
our patch

{code:java}
> svn checkout https://svn.apache.org/repos/asf/nutch/trunk nutch-svn2
> cd nutch-svn2
> patch -p0 < NUTCH-1726-trunk.patch
> ant
> cd src/plugin/headings/
> ant test
{code}

everything seems ok.

yes, you are right, maybe someone want to ignore long headers. But do we need 
to set headings.maxlength option to -1 to disable this check, maybe someone 
want to disable this feature.

Feng






was (Author: amuseme.lu):
Hi Markus

It seems that HeadingsFilter does not find nested nodes in my testing code. but 
I can not restore your testing result when I use following process to testing 
our patch

{code:bash}
> svn checkout https://svn.apache.org/repos/asf/nutch/trunk nutch-svn2
> cd nutch-svn2
> patch -p0 < NUTCH-1726-trunk.patch
> ant
> cd src/plugin/headings/
> ant test
{code}

everything seems ok.

yes, you are right, maybe someone want to ignore long headers. But do we need 
to set headings.maxlength option to -1 to disable this check, maybe someone 
want to disable this feature.

Feng





> HeadingsFilter does not find nested nodes
> -
>
> Key: NUTCH-1726
> URL: https://issues.apache.org/jira/browse/NUTCH-1726
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.8
>
> Attachments: NUTCH-1726-trunk-v2.patch, NUTCH-1726-trunk.patch, 
> NUTCH-1726-trunk.patch
>
>
> Filter won't find:
> {code}
> apache nutch
> {code}
> The getNodeValue() tries to read data from children but should traverse nodes 
> instead.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1726) HeadingsFilter does not find nested nodes

2014-02-24 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13910355#comment-13910355
 ] 

lufeng commented on NUTCH-1726:
---

Hi Markus

It seems that HeadingsFilter does not find nested nodes in my testing code. but 
I can not restore your testing result when I use following process to testing 
our patch

{code:bash}
> svn checkout https://svn.apache.org/repos/asf/nutch/trunk nutch-svn2
> cd nutch-svn2
> patch -p0 < NUTCH-1726-trunk.patch
> ant
> cd src/plugin/headings/
> ant test
{code}

everything seems ok.

yes, you are right, maybe someone want to ignore long headers. But do we need 
to set headings.maxlength option to -1 to disable this check, maybe someone 
want to disable this feature.

Feng





> HeadingsFilter does not find nested nodes
> -
>
> Key: NUTCH-1726
> URL: https://issues.apache.org/jira/browse/NUTCH-1726
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.8
>
> Attachments: NUTCH-1726-trunk-v2.patch, NUTCH-1726-trunk.patch, 
> NUTCH-1726-trunk.patch
>
>
> Filter won't find:
> {code}
> apache nutch
> {code}
> The getNodeValue() tries to read data from children but should traverse nodes 
> instead.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1726) HeadingsFilter does not find nested nodes

2014-02-13 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13900432#comment-13900432
 ] 

lufeng commented on NUTCH-1726:
---

Hi Markus. 

But I didn't find any error using your newest patch. 

{code:xml}
test:
[echo] Testing plugin: headings
[junit] Running org.apache.nutch.parse.headings.TestHeadingsParseFilter
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 1.142 sec

BUILD SUCCESSFUL
Total time: 3 seconds
{code}

* maybe you can truncate log headers if it's size is larger than the value of 
maxlength option. so headings.truncate option can be removed.




> HeadingsFilter does not find nested nodes
> -
>
> Key: NUTCH-1726
> URL: https://issues.apache.org/jira/browse/NUTCH-1726
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.8
>
> Attachments: NUTCH-1726-trunk-v2.patch, NUTCH-1726-trunk.patch, 
> NUTCH-1726-trunk.patch
>
>
> Filter won't find:
> {code}
> apache nutch
> {code}
> The getNodeValue() tries to read data from children but should traverse nodes 
> instead.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1726) HeadingsFilter does not find nested nodes

2014-02-12 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1726:
--

Attachment: NUTCH-1726-trunk-v2.patch

add a test case to check HeadingsFilter patch. :)

> HeadingsFilter does not find nested nodes
> -
>
> Key: NUTCH-1726
> URL: https://issues.apache.org/jira/browse/NUTCH-1726
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.8
>
> Attachments: NUTCH-1726-trunk-v2.patch, NUTCH-1726-trunk.patch
>
>
> Filter won't find:
> {code}
> apache nutch
> {code}
> The getNodeValue() tries to read data from children but should traverse nodes 
> instead.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1691) DomainBlacklist url filter does not allow -D filter file override

2014-01-03 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13861502#comment-13861502
 ] 

lufeng commented on NUTCH-1691:
---

like urlfilter-prefix plugin, we can move WARN code to maintain the code unity. 
:)

> DomainBlacklist url filter does not allow -D filter file override
> -
>
> Key: NUTCH-1691
> URL: https://issues.apache.org/jira/browse/NUTCH-1691
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.8, 2.4
>
> Attachments: NUTCH-1691-trunk.patch
>
>
> This filter does not accept -Durlfilter.domainblacklist.file= overrides. The 
> plugin's file attribute is always used.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1647) protocol-http throws unzipBestEffort returned null for some pages

2014-01-03 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13861491#comment-13861491
 ] 

lufeng commented on NUTCH-1647:
---

yes, but we change check this property in protocol plugins like this

{code:java}
  Response response;
  if(conf.getInt("http.redirect.max", 3) > 0)
response = getResponse(u, datum, true); // make a request and follow 
redirects
  else
response = getResponse(u,datum,false)
{code}

so if we define this property, protocol plugins will follow redirects, else not 
follow redirects. 

> protocol-http throws unzipBestEffort returned null for some pages
> -
>
> Key: NUTCH-1647
> URL: https://issues.apache.org/jira/browse/NUTCH-1647
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Affects Versions: 1.7
>Reporter: Markus Jelsma
> Fix For: 1.8
>
>
> bin/nutch indexchecker 
> http://www.provinciegroningen.nl/actueel/dossiers/rwe-centrale  
> Fetch failed with protocol status: exception(16), lastModified=0: 
> java.io.IOException: unzipBestEffort returned null
> {code}
> 2013-10-01 13:44:55,612 INFO  http.Http - http.proxy.host = null
> 2013-10-01 13:44:55,612 INFO  http.Http - http.proxy.port = 8080
> 2013-10-01 13:44:55,612 INFO  http.Http - http.timeout = 12000
> 2013-10-01 13:44:55,612 INFO  http.Http - http.content.limit = 5242880
> 2013-10-01 13:44:55,612 INFO  http.Http - http.agent = Mozilla/5.0 
> (compatible; OpenindexSpider; 
> +http://www.openindex.io/en/webmasters/spider.html)
> 2013-10-01 13:44:55,612 INFO  http.Http - http.accept.language = 
> en-us,en-gb,en;q=0.7,*;q=0.3
> 2013-10-01 13:44:55,613 INFO  http.Http - http.accept = 
> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> 2013-10-01 13:44:55,925 ERROR http.Http - Failed to get protocol output
> java.io.IOException: unzipBestEffort returned null
> at 
> org.apache.nutch.protocol.http.api.HttpBase.processGzipEncoded(HttpBase.java:317)
> at 
> org.apache.nutch.protocol.http.HttpResponse.(HttpResponse.java:164)
> at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)
> at 
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:140)
> at 
> org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:86)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at 
> org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:150)
> {code}
> Haven't got a clue yet as to what the exact issue is.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1647) protocol-http throws unzipBestEffort returned null for some pages

2014-01-01 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13859880#comment-13859880
 ] 

lufeng commented on NUTCH-1647:
---

This is cause by return content length is 0. and the unzipBestEffort method 
return null.

{code:java}
content = GZIPUtils.unzipBestEffort(compressed);
{code}

{code:bash}
lemo@debian:~/Workspace/java/apache-workspace/nutch-svn/runtime/local$ wget 
--verbose --server-response 
http://www.provinciegroningen.nl/actueel/dossiers/rwe-centrale
--2014-01-01 21:47:06--  
http://www.provinciegroningen.nl/actueel/dossiers/rwe-centrale
Resolving www.provinciegroningen.nl (www.provinciegroningen.nl)... 194.13.8.20
Connecting to www.provinciegroningen.nl 
(www.provinciegroningen.nl)|194.13.8.20|:80... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 301 TYPO3 RealURL redirect
  Date: Wed, 01 Jan 2014 13:47:22 GMT
  Server: Apache
  Location: http://www.provinciegroningen.nl/actueel/dossiers/rwe-centrale/
  Cache-Control: max-age=3600
  Expires: Wed, 01 Jan 2014 14:47:22 GMT
  Vary: Accept-Encoding
  Content-Length: 0
  Content-Type: text/html; charset=UTF-8
  Connection: Keep-Alive
  Set-Cookie: fe_typo_user=56acbb2f413742a928a94ebf51a51bcd; path=/
  Age: 0
Location: http://www.provinciegroningen.nl/actueel/dossiers/rwe-centrale/ 
[following]
--2014-01-01 21:47:13--  
http://www.provinciegroningen.nl/actueel/dossiers/rwe-centrale/
Reusing existing connection to www.provinciegroningen.nl:80.
HTTP request sent, awaiting response... 
  HTTP/1.1 200 OK
  Date: Wed, 01 Jan 2014 13:47:22 GMT
  Server: Apache
  Cache-Control: max-age=3600
  Expires: Wed, 01 Jan 2014 14:47:22 GMT
  Vary: Accept-Encoding
  Transfer-Encoding: chunked
  Content-Type: text/html; charset=utf-8
  Connection: Keep-Alive
  Set-Cookie: fe_typo_user=29b705c75f2c6ff9cf495577efd727dd; path=/
  Age: 0
Length: unspecified [text/html]
Saving to: `rwe-centrale.2'

[ <=>   
  ] 51,728  2.92K/s   in 34s 

2014-01-01 21:47:48 (1.49 KB/s) - `rwe-centrale.2' saved [51728]
{code}

if you use httpclient protocol plugin and open follow redirects option, it will 
download the page correctly.

{code:java}
lemo@debian:~/Workspace/java/apache-workspace/nutch-svn/runtime/local$ 
bin/nutch indexchecker 
http://www.provinciegroningen.nl/actueel/dossiers/rwe-centrale
fetching: http://www.provinciegroningen.nl/actueel/dossiers/rwe-centrale
parsing: http://www.provinciegroningen.nl/actueel/dossiers/rwe-centrale
contentType: application/xhtml+xml
content :   Provincie Groningen: RWE-centrale Provincie Groningen  >  
Actueel  >  Dossiers  > RWE-centrale RWE-c
title : Provincie Groningen: RWE-centrale
host :  www.provinciegroningen.nl
tstamp :Wed Jan 01 22:03:40 CST 2014
url :   http://www.provinciegroningen.nl/actueel/dossiers/rwe-centrale
{code}

but this option is always false setting in HttpBase class.

{code:java}
  public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum) {

String urlString = url.toString();
try {
  URL u = new URL(urlString);
  Response response = getResponse(u, datum, false); // make a request
{code}

so current solution
1. get that option in Configuration file and get that option in 
getProtocolOuput interface

but for protocol-http plugin, we need to write some code to handler url 
redirect.

> protocol-http throws unzipBestEffort returned null for some pages
> -
>
> Key: NUTCH-1647
> URL: https://issues.apache.org/jira/browse/NUTCH-1647
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Affects Versions: 1.7
>Reporter: Markus Jelsma
> Fix For: 1.8
>
>
> bin/nutch indexchecker 
> http://www.provinciegroningen.nl/actueel/dossiers/rwe-centrale  
> Fetch failed with protocol status: exception(16), lastModified=0: 
> java.io.IOException: unzipBestEffort returned null
> {code}
> 2013-10-01 13:44:55,612 INFO  http.Http - http.proxy.host = null
> 2013-10-01 13:44:55,612 INFO  http.Http - http.proxy.port = 8080
> 2013-10-01 13:44:55,612 INFO  http.Http - http.timeout = 12000
> 2013-10-01 13:44:55,612 INFO  http.Http - http.content.limit = 5242880
> 2013-10-01 13:44:55,612 INFO  http.Http - http.agent = Mozilla/5.0 
> (compatible; OpenindexSpider; 
> +http://www.openindex.io/en/webmasters/spider.html)
> 2013-10-01 13:44:55,612 INFO  http.Http - http.accept.language = 
> en-us,en-gb,en;q=0.7,*;q=0.3
> 2013-10-01 13:44:55,613 INFO  http.Http - http.accept = 
> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> 2013-10-01 13:44:55,925 ERROR http.Http - Failed to get protocol output
> java.io.IOException: unzipBestEffort returned null
> at 
> org.apache.nutch.protocol.http.api.HttpB

[jira] [Commented] (NUTCH-1671) indexchecker to add digest field

2013-11-22 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13830530#comment-13830530
 ] 

lufeng commented on NUTCH-1671:
---

yes, this field can be used by indexing filters.  +1
another question is that should we add check code after parse content like this

{code:java}
ParseResult parseResult = new ParseUtil(conf).parse(content); 

if (parseResult == null) {
  LOG.error("Problem with parse - check log");
  return (-1);
}

{code}

> indexchecker to add digest field
> 
>
> Key: NUTCH-1671
> URL: https://issues.apache.org/jira/browse/NUTCH-1671
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.7, 2.2.1
>Reporter: Sebastian Nagel
>Priority: Trivial
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1671-2x.patch, NUTCH-1671-trunk.patch
>
>
> IndexingFiltersChecker does not add field "digest" as done by 
> IndexerMapReduce. Digest/signature could be also used by indexing filters 
> which then may fail.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1667) Updatedb always ignore batchId

2013-11-22 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13830525#comment-13830525
 ] 

lufeng commented on NUTCH-1667:
---

yes, u are right. +1

> Updatedb always ignore batchId
> --
>
> Key: NUTCH-1667
> URL: https://issues.apache.org/jira/browse/NUTCH-1667
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.3
>Reporter: Nguyen Manh Tien
>Priority: Minor
> Attachments: NUTCH-1556-batchId.patch
>
>
> batchId is not set in currentJob because we set batchId after creating 
> currentJob, so in UpdateDbMapper batchId is null and will be assign to -all.
> I change to set batchId befor creating currentJob



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Work started] (NUTCH-1670) set same crawldb directory in mergedb parameter

2013-11-20 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-1670 started by lufeng.

> set same crawldb directory in mergedb parameter
> ---
>
> Key: NUTCH-1670
> URL: https://issues.apache.org/jira/browse/NUTCH-1670
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.7
>Reporter: lufeng
>Assignee: lufeng
>Priority: Minor
> Fix For: 1.8
>
> Attachments: NUTCH-1670.patch
>
>
> when merge two crawldb using the same crawldb directory in bin/nutch merge 
> paramater, it will throw data not found exception. 
> bin/nutch mergedb crawldb_t1 crawldb_t1 crawldb_2
> bin/nutch generate crawldb_t1 segment



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (NUTCH-1670) set same crawldb directory in mergedb parameter

2013-11-20 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1670:
--

Attachment: NUTCH-1670.patch

> set same crawldb directory in mergedb parameter
> ---
>
> Key: NUTCH-1670
> URL: https://issues.apache.org/jira/browse/NUTCH-1670
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.7
>Reporter: lufeng
>Assignee: lufeng
>Priority: Minor
> Fix For: 1.8
>
> Attachments: NUTCH-1670.patch
>
>
> when merge two crawldb using the same crawldb directory in bin/nutch merge 
> paramater, it will throw data not found exception. 
> bin/nutch mergedb crawldb_t1 crawldb_t1 crawldb_2
> bin/nutch generate crawldb_t1 segment



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (NUTCH-1670) set same crawldb directory in mergedb parameter

2013-11-20 Thread lufeng (JIRA)
lufeng created NUTCH-1670:
-

 Summary: set same crawldb directory in mergedb parameter
 Key: NUTCH-1670
 URL: https://issues.apache.org/jira/browse/NUTCH-1670
 Project: Nutch
  Issue Type: Bug
  Components: crawldb
Affects Versions: 1.7
Reporter: lufeng
Assignee: lufeng
Priority: Minor
 Fix For: 1.8


when merge two crawldb using the same crawldb directory in bin/nutch merge 
paramater, it will throw data not found exception. 

bin/nutch mergedb crawldb_t1 crawldb_t1 crawldb_2
bin/nutch generate crawldb_t1 segment





--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1651) modifiedTime and prevmodifiedTime never set

2013-11-04 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13812840#comment-13812840
 ] 

lufeng commented on NUTCH-1651:
---

Hi Lewis
yes, the patch is ok, and this a way to set ModifiedTime. +1 

> modifiedTime and prevmodifiedTime never set 
> 
>
> Key: NUTCH-1651
> URL: https://issues.apache.org/jira/browse/NUTCH-1651
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.2.1
>Reporter: Talat UYARER
> Fix For: 2.3
>
> Attachments: NUTCH-1651.patch
>
>
> modifiedTime is never set. If you use DefaultFetchScheduler, modifiedTime is 
> always zero as default. But if you use AdaptiveFetchScheduler, modifiedTime 
> is set only once in the beginning by zero-control of AdaptiveFetchScheduler.
> But this is not sufficient since modifiedTime needs to be updated whenever 
> last modified time is available. We corrected this with a patch.
> Also we noticed that prevModifiedTime is not written to database and we 
> corrected that too.
> With this patch, whenever lastModifiedTime is available, we do two things. 
> First we set modifiedTime in the Page object to prevModifiedTime. After that 
> we set lastModifiedTime to modifiedTime.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1651) modifiedTime and prevmodifiedTime never set

2013-10-30 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13809081#comment-13809081
 ] 

lufeng commented on NUTCH-1651:
---

Hi Talat

yes, u are right, lastModified is a fetch parameter, but this can also be set 
by parser plugins, because this attribute can also defined by parsers. it's a 
attribute of WebPage. 

I don't find any code in Nutch 2.x to set the ModifiedTime in WebPage, also not 
find in Nutch1.x. very strange.



> modifiedTime and prevmodifiedTime never set 
> 
>
> Key: NUTCH-1651
> URL: https://issues.apache.org/jira/browse/NUTCH-1651
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.2.1
>Reporter: Talat UYARER
> Fix For: 2.3
>
> Attachments: NUTCH-1651.patch
>
>
> modifiedTime is never set. If you use DefaultFetchScheduler, modifiedTime is 
> always zero as default. But if you use AdaptiveFetchScheduler, modifiedTime 
> is set only once in the beginning by zero-control of AdaptiveFetchScheduler.
> But this is not sufficient since modifiedTime needs to be updated whenever 
> last modified time is available. We corrected this with a patch.
> Also we noticed that prevModifiedTime is not written to database and we 
> corrected that too.
> With this patch, whenever lastModifiedTime is available, we do two things. 
> First we set modifiedTime in the Page object to prevModifiedTime. After that 
> we set lastModifiedTime to modifiedTime.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1564) AdaptiveFetchSchedule: sync_delta forces immediate refetch for documents not modified

2013-10-29 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13808091#comment-13808091
 ] 

lufeng commented on NUTCH-1564:
---

yes, this problem cause by the range of interval value. maybe this delta has 
also need to limited by a max value , such as MAX_INTERVAL

{code:java}
  if (SYNC_DELTA) {
// try to synchronize with the time of change
long delta = (fetchTime - modifiedTime) / 1000L;
if (delta > interval) interval = delta;
if (delta < MIN_INTERVAL) {
delta = MIN_INTERVAL;
} else if (delta > MAX_INTERVAL) {
delta = MAX_INTERVAL;
}
refTime = fetchTime - Math.round(delta * SYNC_DELTA_RATE * 1000);
  }
  if (interval < MIN_INTERVAL) {
interval = MIN_INTERVAL;
  } else if (interval > MAX_INTERVAL) {
interval = MAX_INTERVAL;
  }
...
datum.setFetchTime(refTime + Math.round(interval * 1000.0));
{code}

so the final fetch time is fetchTime + fetchInterval  - delta * SYNC_DELA_RATE 
= fetchTime + 4.9 day

or can we limit the interval after call the setFetchTime method

{code:java}
 if (SYNC_DELTA) {
// try to synchronize with the time of change
long delta = (fetchTime - modifiedTime) / 1000L; 
if (delta > interval) interval = delta;
refTime = fetchTime - Math.round(delta * SYNC_DELTA_RATE * 1000); 
  }
}
datum.setFetchTime(refTime + Math.round(interval * 1000.0));  
if (interval < MIN_INTERVAL) {
  interval = MIN_INTERVAL;
} else if (interval > MAX_INTERVAL) {
  interval = MAX_INTERVAL;
}
datum.setFetchInterval(interval);
datum.setModifiedTime(modifiedTime);
{code}

> AdaptiveFetchSchedule: sync_delta forces immediate refetch for documents not 
> modified
> -
>
> Key: NUTCH-1564
> URL: https://issues.apache.org/jira/browse/NUTCH-1564
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.6, 2.1
>Reporter: Sebastian Nagel
>Priority: Critical
>
> In a continuous crawl with adaptive fetch scheduling documents not modified 
> for a longer time are may be fetched in every cycle.
> A continous crawl is run daily with a 3 cycles and the following scheduling 
> intervals (freshness matters):
> {code}
> db.fetch.schedule.class = org.apache.nutch.crawl.AdaptiveFetchSchedule
> db.fetch.schedule.adaptive.sync_delta   = true (default)
> db.fetch.schedule.adaptive.sync_delta_rate = 0.3 (default)
> db.fetch.interval.default   = 172800 (2 days)
> db.fetch.schedule.adaptive.min_interval =  86400 (1 day)
> db.fetch.schedule.adaptive.max_interval = 604800 (7 days)
> db.fetch.interval.max   = 604800 (7 days)
> {code}
> At Apr 18 a URL is generated and fetched (from segment dump):
> {code}
> Crawl Generate::
> Status: 2 (db_fetched)
> Fetch time: Mon Apr 15 19:43:22 CEST 2013
> Modified time: Tue Mar 19 01:07:42 CET 2013
> Retries since fetch: 0
> Retry interval: 604800 seconds (7 days)
> Crawl Fetch::
> Status: 33 (fetch_success)
> Fetch time: Thu Apr 18 01:23:51 CEST 2013
> Modified time: Tue Mar 19 01:07:42 CET 2013
> Retries since fetch: 0
> Retry interval: 604800 seconds (7 days)
> {code}
> Running CrawlDb update results in a next fetch time in the past (which forces 
> an immediate refetch in the next cycle):
> {code}
> Status: 6 (db_notmodified)
> Fetch time: Tue Apr 16 01:37:00 CEST 2013
> Modified time: Tue Mar 19 01:07:42 CET 2013
> Retries since fetch: 0
> Retry interval: 604800 seconds (7 days)
> {code}
> This behavior is caused by the sync_delta calculation in 
> AdaptiveFetchSchedule:
> {code}
>   if (SYNC_DELTA) {
> // try to synchronize with the time of change
> long delta = (fetchTime - modifiedTime) / 1000L;
> if (delta > interval) interval = delta;
> refTime = fetchTime - Math.round(delta * SYNC_DELTA_RATE * 1000);
>   }
>   if (interval < MIN_INTERVAL) {
> interval = MIN_INTERVAL;
>   } else if (interval > MAX_INTERVAL) {
> interval = MAX_INTERVAL;
>   }
> ...
> datum.setFetchTime(refTime + Math.round(interval * 1000.0));
> {code}
> {{delta}} is 30 days (Apr 18 - Mar 19). {{refTime}} is then 9 days in the 
> past ({{delta}} * 0.3). After adding {{interval}} (adjusted to 
> {{MAX_INTERVAL}} = 7 days) to {{refTime}} the next fetch "should" take place 
> 2 days in the past (Apr 16).
> According to the 
> [javadoc|http://nutch.apache.org/apidocs-2.1/org/apache/nutch/crawl/AdaptiveFetchSchedule.html]
>  (if understood right), there are to aims of the sync_delta if we know that a 
> document hasn't been modified for long:
> * increase the fetch interval immediately (not step by step)
> * because we expect the document to be changed within the adaptive interval 
> (but it hasn't), we shift the "reference time", i.e

[jira] [Commented] (NUTCH-1651) modifiedTime and prevmodifiedTime never set

2013-10-29 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13808045#comment-13808045
 ] 

lufeng commented on NUTCH-1651:
---

Hi Talat

but I think get last modified from header is not appropriate to put in here. If 
user want to check the modification of a html in parser plugin through it's 
content of that url not that metadata in html headers. even the value of 
"Last-Modified" in headers is changed.

{code:java}
+Utf8 lastModified = page.getFromHeaders(new Utf8("Last-Modified"));
+if ( lastModified != null ){
+  try {
+modifiedTime = HttpDateFormat.toLong(lastModified.toString());
+prevModifiedTime = page.getModifiedTime();
+  } catch (Exception e) {
+  }
+}
{code}

maybe appropriate way is to let parser plugin defined by user to set the value 
of modified time not in DbUpdateReducer class.

> modifiedTime and prevmodifiedTime never set 
> 
>
> Key: NUTCH-1651
> URL: https://issues.apache.org/jira/browse/NUTCH-1651
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.2.1
>Reporter: Talat UYARER
> Fix For: 2.3
>
> Attachments: NUTCH-1651.patch
>
>
> modifiedTime is never set. If you use DefaultFetchScheduler, modifiedTime is 
> always zero as default. But if you use AdaptiveFetchScheduler, modifiedTime 
> is set only once in the beginning by zero-control of AdaptiveFetchScheduler.
> But this is not sufficient since modifiedTime needs to be updated whenever 
> last modified time is available. We corrected this with a patch.
> Also we noticed that prevModifiedTime is not written to database and we 
> corrected that too.
> With this patch, whenever lastModifiedTime is available, we do two things. 
> First we set modifiedTime in the Page object to prevModifiedTime. After that 
> we set lastModifiedTime to modifiedTime.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (NUTCH-1645) Junit Test Case for Adaptive Fetch Schedule class

2013-10-28 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1645:
--

Attachment: NUTCH-1645-v3.patch

1. add an implementation of reaches a lower number of misses would cause the 
test to fail
2. improve the code style 

yes, you are right, this unit test only check for the equality of some "key 
statistics" as you said. But the problem is how to write test case to verify 
the correctness of some algorithms in Nutch like AdaptiveFetchSchedule class 
and find the bug that you pointed in (NUTCH-1564)? Could you give me some 
suggestions. and I will check the NUTCH-1564 and hope to find a solution to 
this issue.

Thanks Sebastian

> Junit Test Case for Adaptive Fetch Schedule class
> -
>
> Key: NUTCH-1645
> URL: https://issues.apache.org/jira/browse/NUTCH-1645
> Project: Nutch
>  Issue Type: Test
>Affects Versions: 2.2.1
>Reporter: Talat UYARER
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1645.patch, NUTCH-1645-v2.patch, 
> NUTCH-1645-v3.patch
>
>
> Currently there is not Test Case for Adaptive Fetch Schedule. Junit test 
> Writes for its. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1650) Adaptive Fetch Scheduler interval Wrong Set

2013-10-06 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13787664#comment-13787664
 ] 

lufeng commented on NUTCH-1650:
---

yes , this code in Nutch 1.x is correct. +1

> Adaptive Fetch Scheduler interval Wrong Set
> ---
>
> Key: NUTCH-1650
> URL: https://issues.apache.org/jira/browse/NUTCH-1650
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.2.1
>Reporter: Talat UYARER
>Priority: Minor
>  Labels: scheduler
> Fix For: 2.3
>
> Attachments: NUTCH-1650.patch
>
>
> After calculation interval time when setting it didn't check between max and 
> min values.  Moreover if sync_delta is true. Interval set before changes. 
> This patch fix this.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (NUTCH-1645) Junit Test Case for Adaptive Fetch Schedule class

2013-10-06 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1645:
--

Attachment: NUTCH-1645-v2.patch

add two test case, one is use default parameters and another without open sync 
delta. 

thanks Yasin, you can add another test case with some parameter change.  

> Junit Test Case for Adaptive Fetch Schedule class
> -
>
> Key: NUTCH-1645
> URL: https://issues.apache.org/jira/browse/NUTCH-1645
> Project: Nutch
>  Issue Type: Test
>Affects Versions: 2.2.1
>Reporter: Talat UYARER
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1645.patch, NUTCH-1645-v2.patch
>
>
> Currently there is not Test Case for Adaptive Fetch Schedule. Junit test 
> Writes for its. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId

2013-09-12 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13765410#comment-13765410
 ] 

lufeng commented on NUTCH-1556:
---

oh, I'm so sorry, I already fixed this problem.

commit revision 1522566 in 2.x HEAD.

thanks Julien.

> enabling updatedb to accept batchId 
> 
>
> Key: NUTCH-1556
> URL: https://issues.apache.org/jira/browse/NUTCH-1556
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.2
>Reporter: kaveh minooie
> Fix For: 2.3
>
> Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, 
> NUTCH-1556-v3.patch
>
>
> So the idea here is to be able to run updatedb and fetch for different 
> batchId simultaneously. I put together a patch. it seems to be working ( it 
> does skip the rows that do not match the batchId), but I am worried if and 
> how it might affect the sorting in the reduce part. anyway check it out. 
> it also change the command line usage to this:
> Usage: DbUpdaterJob ( | -all) [-crawlId ]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1636) Indexer to normalize and filter repr URL

2013-09-09 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13761888#comment-13761888
 ] 

lufeng commented on NUTCH-1636:
---

yes, this patch can solve the issue reported by lain. +1

> Indexer to normalize and filter repr URL
> 
>
> Key: NUTCH-1636
> URL: https://issues.apache.org/jira/browse/NUTCH-1636
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.6, 1.7
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.8
>
> Attachments: NUTCH-1636-1.patch
>
>
> Indexer if used with option -normalize and/or -filter (cf. NUTCH-1300) should 
> also normalize and filter representation URLs. Otherwise URLs which are 
> target of a redirect, and have repr URL set (see URLUtil.chooseRepr) may show 
> up in index with an undesirable URL.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1556) enabling updatedb to accept batchId

2013-09-05 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng resolved NUTCH-1556.
---

Resolution: Fixed

> enabling updatedb to accept batchId 
> 
>
> Key: NUTCH-1556
> URL: https://issues.apache.org/jira/browse/NUTCH-1556
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.2
>Reporter: kaveh minooie
> Fix For: 2.3
>
> Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, 
> NUTCH-1556-v3.patch
>
>
> So the idea here is to be able to run updatedb and fetch for different 
> batchId simultaneously. I put together a patch. it seems to be working ( it 
> does skip the rows that do not match the batchId), but I am worried if and 
> how it might affect the sorting in the reduce part. anyway check it out. 
> it also change the command line usage to this:
> Usage: DbUpdaterJob ( | -all) [-crawlId ]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId

2013-09-05 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13759123#comment-13759123
 ] 

lufeng commented on NUTCH-1556:
---

Committed revision 1520332 in 2.x HEAD
Thanks kaveh. 

> enabling updatedb to accept batchId 
> 
>
> Key: NUTCH-1556
> URL: https://issues.apache.org/jira/browse/NUTCH-1556
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.2
>Reporter: kaveh minooie
> Fix For: 2.3
>
> Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, 
> NUTCH-1556-v3.patch
>
>
> So the idea here is to be able to run updatedb and fetch for different 
> batchId simultaneously. I put together a patch. it seems to be working ( it 
> does skip the rows that do not match the batchId), but I am worried if and 
> how it might affect the sorting in the reduce part. anyway check it out. 
> it also change the command line usage to this:
> Usage: DbUpdaterJob ( | -all) [-crawlId ]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId

2013-09-02 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13756080#comment-13756080
 ] 

lufeng commented on NUTCH-1556:
---

I will commit this unless there are objections

> enabling updatedb to accept batchId 
> 
>
> Key: NUTCH-1556
> URL: https://issues.apache.org/jira/browse/NUTCH-1556
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.2
>Reporter: kaveh minooie
> Fix For: 2.3
>
> Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, 
> NUTCH-1556-v3.patch
>
>
> So the idea here is to be able to run updatedb and fetch for different 
> batchId simultaneously. I put together a patch. it seems to be working ( it 
> does skip the rows that do not match the batchId), but I am worried if and 
> how it might affect the sorting in the reduce part. anyway check it out. 
> it also change the command line usage to this:
> Usage: DbUpdaterJob ( | -all) [-crawlId ]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId

2013-08-28 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13752432#comment-13752432
 ] 

lufeng commented on NUTCH-1556:
---

thanks kaveh. +1

> enabling updatedb to accept batchId 
> 
>
> Key: NUTCH-1556
> URL: https://issues.apache.org/jira/browse/NUTCH-1556
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.2
>Reporter: kaveh minooie
> Fix For: 2.3
>
> Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, 
> NUTCH-1556-v3.patch
>
>
> So the idea here is to be able to run updatedb and fetch for different 
> batchId simultaneously. I put together a patch. it seems to be working ( it 
> does skip the rows that do not match the batchId), but I am worried if and 
> how it might affect the sorting in the reduce part. anyway check it out. 
> it also change the command line usage to this:
> Usage: DbUpdaterJob ( | -all) [-crawlId ]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1556) enabling updatedb to accept batchId

2013-08-27 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1556:
--

Attachment: NUTCH-1556-v2.patch

new patch merged with issue 1632

> enabling updatedb to accept batchId 
> 
>
> Key: NUTCH-1556
> URL: https://issues.apache.org/jira/browse/NUTCH-1556
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.2
>Reporter: kaveh minooie
> Fix For: 2.3
>
> Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch
>
>
> So the idea here is to be able to run updatedb and fetch for different 
> batchId simultaneously. I put together a patch. it seems to be working ( it 
> does skip the rows that do not match the batchId), but I am worried if and 
> how it might affect the sorting in the reduce part. anyway check it out. 
> it also change the command line usage to this:
> Usage: DbUpdaterJob ( | -all) [-crawlId ]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1632) add batchId argument for DbUpdaterJob

2013-08-26 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13750804#comment-13750804
 ] 

lufeng commented on NUTCH-1632:
---

Hi kaveh, I'm sorry and I will close this issue and merge the two patch into 
one. thanks.

> add batchId argument for DbUpdaterJob
> -
>
> Key: NUTCH-1632
> URL: https://issues.apache.org/jira/browse/NUTCH-1632
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Affects Versions: 2.2.1
>Reporter: lufeng
>Assignee: lufeng
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1632.patch
>
>
> add batchId argument for DbUpdaterJob, you can put the batchId to 
> DbUpdaterJob.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId

2013-08-26 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13750803#comment-13750803
 ] 

lufeng commented on NUTCH-1556:
---

Hi Lewis, I'm sorry, I generate a duplicate issue. I will merge these two patch 
into one and can you check this out. thanks.

> enabling updatedb to accept batchId 
> 
>
> Key: NUTCH-1556
> URL: https://issues.apache.org/jira/browse/NUTCH-1556
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.2
>Reporter: kaveh minooie
> Fix For: 2.3
>
> Attachments: NUTCH-1556.patch
>
>
> So the idea here is to be able to run updatedb and fetch for different 
> batchId simultaneously. I put together a patch. it seems to be working ( it 
> does skip the rows that do not match the batchId), but I am worried if and 
> how it might affect the sorting in the reduce part. anyway check it out. 
> it also change the command line usage to this:
> Usage: DbUpdaterJob ( | -all) [-crawlId ]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-1632) add batchId argument for DbUpdaterJob

2013-08-26 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng closed NUTCH-1632.
-

Resolution: Duplicate

> add batchId argument for DbUpdaterJob
> -
>
> Key: NUTCH-1632
> URL: https://issues.apache.org/jira/browse/NUTCH-1632
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Affects Versions: 2.2.1
>Reporter: lufeng
>Assignee: lufeng
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1632.patch
>
>
> add batchId argument for DbUpdaterJob, you can put the batchId to 
> DbUpdaterJob.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1632) add batchId argument for DbUpdaterJob

2013-08-26 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1632:
--

Attachment: NUTCH-1632.patch

> add batchId argument for DbUpdaterJob
> -
>
> Key: NUTCH-1632
> URL: https://issues.apache.org/jira/browse/NUTCH-1632
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Affects Versions: 2.2.1
>Reporter: lufeng
>Assignee: lufeng
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1632.patch
>
>
> add batchId argument for DbUpdaterJob, you can put the batchId to 
> DbUpdaterJob.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (NUTCH-1632) add batchId argument for DbUpdaterJob

2013-08-26 Thread lufeng (JIRA)
lufeng created NUTCH-1632:
-

 Summary: add batchId argument for DbUpdaterJob
 Key: NUTCH-1632
 URL: https://issues.apache.org/jira/browse/NUTCH-1632
 Project: Nutch
  Issue Type: Improvement
  Components: crawldb
Affects Versions: 2.2.1
Reporter: lufeng
Assignee: lufeng
Priority: Minor
 Fix For: 2.3


add batchId argument for DbUpdaterJob, you can put the batchId to DbUpdaterJob.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1619) Writes Dmoz Description and Title information to db with snippet argument

2013-08-25 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13749663#comment-13749663
 ] 

lufeng commented on NUTCH-1619:
---

Hi Julien,I have already fixed the compilation bug, and I will be pay attention 
in the next time, thanks for reminding. 

> Writes Dmoz Description and Title information to db with snippet argument
> -
>
> Key: NUTCH-1619
> URL: https://issues.apache.org/jira/browse/NUTCH-1619
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.1
>Reporter: Yasin Kılınç
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1619.patch, NUTCH-DMOZ-Snippet.patch
>
>
> We need Dmoz information of fetched URLs can be written to database. So these 
> information can be used like snipppet by indexer of the search engine we are 
> working on.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1619) Writes Dmoz Description and Title information to db with snippet argument

2013-08-24 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13749419#comment-13749419
 ] 

lufeng commented on NUTCH-1619:
---

I'm so sorry, DataStore may not throw IOException. It has already been fixed.
Committed @revision 1517155 in 2.x HEAD

> Writes Dmoz Description and Title information to db with snippet argument
> -
>
> Key: NUTCH-1619
> URL: https://issues.apache.org/jira/browse/NUTCH-1619
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.1
>Reporter: Yasin Kılınç
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1619.patch, NUTCH-DMOZ-Snippet.patch
>
>
> We need Dmoz information of fetched URLs can be written to database. So these 
> information can be used like snipppet by indexer of the search engine we are 
> working on.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1619) Writes Dmoz Description and Title information to db with snippet argument

2013-08-24 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng resolved NUTCH-1619.
---

Resolution: Fixed

> Writes Dmoz Description and Title information to db with snippet argument
> -
>
> Key: NUTCH-1619
> URL: https://issues.apache.org/jira/browse/NUTCH-1619
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.1
>Reporter: Yasin Kılınç
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1619.patch, NUTCH-DMOZ-Snippet.patch
>
>
> We need Dmoz information of fetched URLs can be written to database. So these 
> information can be used like snipppet by indexer of the search engine we are 
> working on.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1619) Writes Dmoz Description and Title information to db with snippet argument

2013-08-24 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13749409#comment-13749409
 ] 

lufeng commented on NUTCH-1619:
---

Committed @revision 1517147 in 2.x HEAD
Thank you very much Talat for the patch.


> Writes Dmoz Description and Title information to db with snippet argument
> -
>
> Key: NUTCH-1619
> URL: https://issues.apache.org/jira/browse/NUTCH-1619
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.1
>Reporter: Yasin Kılınç
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1619.patch, NUTCH-DMOZ-Snippet.patch
>
>
> We need Dmoz information of fetched URLs can be written to database. So these 
> information can be used like snipppet by indexer of the search engine we are 
> working on.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1631) Display Document Count Added To Solr Server

2013-08-23 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13748595#comment-13748595
 ] 

lufeng commented on NUTCH-1631:
---

Good statistical methods. +1 

> Display Document Count Added To Solr Server
> ---
>
> Key: NUTCH-1631
> URL: https://issues.apache.org/jira/browse/NUTCH-1631
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 2.1, 2.2, 2.2.1
>Reporter: Furkan KAMACI
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1631.patch
>
>
> Currently you can not see how many documents are added to Solr Server from 
> Nutch. One should be able to see how many documents are added to Solr Server 
> simultaneously (as a hadoop counter) and also total document count should be 
> logged too after all documents are added to Solr Server.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1619) Writes Dmoz Description and Title information to db with snippet argument

2013-08-22 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13747558#comment-13747558
 ] 

lufeng commented on NUTCH-1619:
---

Thanks Talat. +1 for commit. 

> Writes Dmoz Description and Title information to db with snippet argument
> -
>
> Key: NUTCH-1619
> URL: https://issues.apache.org/jira/browse/NUTCH-1619
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.1
>Reporter: Yasin Kılınç
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1619.patch, NUTCH-DMOZ-Snippet.patch
>
>
> We need Dmoz information of fetched URLs can be written to database. So these 
> information can be used like snipppet by indexer of the search engine we are 
> working on.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1619) Writes Dmoz Description and Title information to db with snippet argument

2013-08-19 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13743621#comment-13743621
 ] 

lufeng commented on NUTCH-1619:
---

Hi Yasin, Do you forget to close the data store? good.

> Writes Dmoz Description and Title information to db with snippet argument
> -
>
> Key: NUTCH-1619
> URL: https://issues.apache.org/jira/browse/NUTCH-1619
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.1
>Reporter: Yasin Kılınç
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-DMOZ-Snippet.patch
>
>
> We need Dmoz information of fetched URLs can be written to database. So these 
> information can be used like snipppet by indexer of the search engine we are 
> working on.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1294) IndexClean job with solr implementation.

2013-08-14 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13739731#comment-13739731
 ] 

lufeng commented on NUTCH-1294:
---

Hi Lewis. Very pleasure. But What can I do something for README.txt? Do you 
mean I will also change something in 
https://svn.apache.org/repos/asf/nutch/branches/2.x/README.txt. :)

> IndexClean job with solr implementation.
> 
>
> Key: NUTCH-1294
> URL: https://issues.apache.org/jira/browse/NUTCH-1294
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: nutchgora
>Reporter: Dan Rosher
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1294.patch, NUTCH-1294-v2.patch, 
> NUTCH-1294-v3.patch
>
>
> I started by copying/altering the trunk version of SolrClean, though is was 
> inadequate for our needs. We needed to mark particular pages as gone even 
> though they still might be visible on the web, this implementation abstracts 
> the index cleaning process, has a Solr implementation, and adds a clean index 
> plugin extension that allows others to tailor how pages might be removed from 
> their store.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1294) IndexClean job with solr implementation.

2013-08-13 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng resolved NUTCH-1294.
---

Resolution: Fixed

> IndexClean job with solr implementation.
> 
>
> Key: NUTCH-1294
> URL: https://issues.apache.org/jira/browse/NUTCH-1294
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: nutchgora
>Reporter: Dan Rosher
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1294.patch, NUTCH-1294-v2.patch, 
> NUTCH-1294-v3.patch
>
>
> I started by copying/altering the trunk version of SolrClean, though is was 
> inadequate for our needs. We needed to mark particular pages as gone even 
> though they still might be visible on the web, this implementation abstracts 
> the index cleaning process, has a Solr implementation, and adds a clean index 
> plugin extension that allows others to tailor how pages might be removed from 
> their store.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1294) IndexClean job with solr implementation.

2013-08-13 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738361#comment-13738361
 ] 

lufeng commented on NUTCH-1294:
---

Committed @revision 1513549 in 2.x HEAD 

> IndexClean job with solr implementation.
> 
>
> Key: NUTCH-1294
> URL: https://issues.apache.org/jira/browse/NUTCH-1294
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: nutchgora
>Reporter: Dan Rosher
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1294.patch, NUTCH-1294-v2.patch, 
> NUTCH-1294-v3.patch
>
>
> I started by copying/altering the trunk version of SolrClean, though is was 
> inadequate for our needs. We needed to mark particular pages as gone even 
> though they still might be visible on the web, this implementation abstracts 
> the index cleaning process, has a Solr implementation, and adds a clean index 
> plugin extension that allows others to tailor how pages might be removed from 
> their store.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1294) IndexClean job with solr implementation.

2013-08-12 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13736978#comment-13736978
 ] 

lufeng commented on NUTCH-1294:
---

passed testing with solr 4.2.1. +1 for commit. 

> IndexClean job with solr implementation.
> 
>
> Key: NUTCH-1294
> URL: https://issues.apache.org/jira/browse/NUTCH-1294
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: nutchgora
>Reporter: Dan Rosher
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1294.patch, NUTCH-1294-v2.patch, 
> NUTCH-1294-v3.patch
>
>
> I started by copying/altering the trunk version of SolrClean, though is was 
> inadequate for our needs. We needed to mark particular pages as gone even 
> though they still might be visible on the web, this implementation abstracts 
> the index cleaning process, has a Solr implementation, and adds a clean index 
> plugin extension that allows others to tailor how pages might be removed from 
> their store.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1613) Timeouts in protocol-httpclient when crawling same host with >2 threads and added cookie strings for both http protocols

2013-07-21 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13714701#comment-13714701
 ] 

lufeng commented on NUTCH-1613:
---

ok, Does this cookie will effect other urls that these urls don't need any 
cookie and will receive "Bad Request" error when using httpclient? It seems not 
very general so can we able to add a filter to specify the different host using 
a different cookie.

> Timeouts in protocol-httpclient when crawling same host with >2 threads and 
> added cookie strings for both http protocols
> 
>
> Key: NUTCH-1613
> URL: https://issues.apache.org/jira/browse/NUTCH-1613
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Affects Versions: 2.2.1
>Reporter: Brian
>Priority: Minor
>  Labels: patch
> Fix For: 2.3
>
> Attachments: NUTCH-1613.patch
>
>
> 1.)  When using protocol-httpclient to crawl a single website (the same host) 
> I would always get a bunch of timeout errors during fetching and the pages 
> with errors would not be fetched. E.g.:
> 2013-07-09 17:57:13,717 WARN  fetcher.FetcherJob - fetch of http://www 
> failed with: org.apache.commons.httpclient.ConnectionPoolTimeoutException: 
> Timeout waiting for connection
> 2013-07-09 17:57:13,718 INFO  fetcher.FetcherJob - fetching http://www 
> (queue crawl delay=0ms)
> 2013-07-09 17:57:13,715 ERROR httpclient.Http - Failed with the following 
> error: 
> org.apache.commons.httpclient.ConnectionPoolTimeoutException: Timeout waiting 
> for connection
>   at 
> org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:497)
>   at 
> org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:416)
>   at 
> org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:153)
>   at 
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
>   at 
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
>   at 
> org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.java:95)
>   at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:174)
>   at 
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:133)
>   at 
> org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:518)
> This is because by default the connection pool manager only allows 2 
> connections per host so if more than 2 threads are used the others will tend 
> to time out waiting to get a connection.   The code previously set max 
> connections correctly but not connection per host.
> 2.) I also added at the same time simple modifications to both protocol-http 
> and protocol-httpclient to allow specifying a cookie string in the conf file 
> to include in request headers.  
> I use this to crawl site content requiring authentication - it is better for 
> me to specify the cookie string for the authentication than go through the 
> whole authentication process and specifying login info.
> The nutch-site.xml property is the following:
> 
> http.cookie_string
> XX_AL=authorization_value_goes_here
>   String to use as the cookie value for HTTP 
> requests
> 
> Although I use it for authentication it can be used to specify any single 
> cookie string for the crawl (httpclient does support different cookies for 
> different hosts but I did not get into that).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1613) Timeouts in protocol-httpclient when crawling same host with >2 threads and added cookie strings for both http protocols

2013-07-17 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711150#comment-13711150
 ] 

lufeng commented on NUTCH-1613:
---

Does this specified cookie string will effect all crawling urls? 

> Timeouts in protocol-httpclient when crawling same host with >2 threads and 
> added cookie strings for both http protocols
> 
>
> Key: NUTCH-1613
> URL: https://issues.apache.org/jira/browse/NUTCH-1613
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Affects Versions: 2.2.1
>Reporter: Brian
>Priority: Minor
>  Labels: patch
> Attachments: NUTCH-1613.patch
>
>
> 1.)  When using protocol-httpclient to crawl a single website (the same host) 
> I would always get a bunch of timeout errors during fetching and the pages 
> with errors would not be fetched. E.g.:
> 2013-07-09 17:57:13,717 WARN  fetcher.FetcherJob - fetch of http://www 
> failed with: org.apache.commons.httpclient.ConnectionPoolTimeoutException: 
> Timeout waiting for connection
> 2013-07-09 17:57:13,718 INFO  fetcher.FetcherJob - fetching http://www 
> (queue crawl delay=0ms)
> 2013-07-09 17:57:13,715 ERROR httpclient.Http - Failed with the following 
> error: 
> org.apache.commons.httpclient.ConnectionPoolTimeoutException: Timeout waiting 
> for connection
>   at 
> org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:497)
>   at 
> org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:416)
>   at 
> org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:153)
>   at 
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
>   at 
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
>   at 
> org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.java:95)
>   at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:174)
>   at 
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:133)
>   at 
> org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:518)
> This is because by default the connection pool manager only allows 2 
> connections per host so if more than 2 threads are used the others will tend 
> to time out waiting to get a connection.   The code previously set max 
> connections correctly but not connection per host.
> 2.) I also added at the same time simple modifications to both protocol-http 
> and protocol-httpclient to allow specifying a cookie string in the conf file 
> to include in request headers.  
> I use this to crawl site content requiring authentication - it is better for 
> me to specify the cookie string for the authentication than go through the 
> whole authentication process and specifying login info.
> The nutch-site.xml property is the following:
> 
> http.cookie_string
> XX_AL=authorization_value_goes_here
>   String to use as the cookie value for HTTP 
> requests
> 
> Although I use it for authentication it can be used to specify any single 
> cookie string for the crawl (httpclient does support different cookies for 
> different hosts but I did not get into that).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1602) improve the readability of metadata in readdb dump normal

2013-07-04 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13700120#comment-13700120
 ] 

lufeng commented on NUTCH-1602:
---

Committed in trunk for rev. 1499779.

Thanks Markus.

> improve the readability of metadata in readdb dump normal 
> --
>
> Key: NUTCH-1602
> URL: https://issues.apache.org/jira/browse/NUTCH-1602
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Affects Versions: 1.7
>Reporter: lufeng
>Assignee: lufeng
>Priority: Minor
> Fix For: 1.8
>
> Attachments: NUTCH-1602-2.patch, NUTCH-1602.patch
>
>
> the dumped metadata format is not readable.
> {code:xml}
> $bin/nutch readdb crawldb/ -dump dir
> http://www.baidu.com/ Version: 7
> Status: 3 (db_gone)
> Fetch time: Sat Aug 17 22:35:37 CST 2013
> Modified time: Thu Jan 01 08:00:00 CST 1970
> Retries since fetch: 0
> Retry interval: 3888000 seconds (45 days)
> Score: 1.0
> Signature: null
> Metadata: m1: v22m3: v3m2: v2m5: v5m4: m4_pst_: robots_denied(18), 
> lastModified=0m6: v6
> {code}
> so I improve the Metadata format to this
> {code:xml}
> Metadata: m1=v22;m3=v3;m2=v2;m5=v5;m4=m4;_pst_=robots_denied(18), 
> lastModified=0;m6=v6;
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1602) improve the readability of metadata in readdb dump normal

2013-07-04 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng resolved NUTCH-1602.
---

Resolution: Fixed

> improve the readability of metadata in readdb dump normal 
> --
>
> Key: NUTCH-1602
> URL: https://issues.apache.org/jira/browse/NUTCH-1602
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Affects Versions: 1.7
>Reporter: lufeng
>Assignee: lufeng
>Priority: Minor
> Fix For: 1.8
>
> Attachments: NUTCH-1602-2.patch, NUTCH-1602.patch
>
>
> the dumped metadata format is not readable.
> {code:xml}
> $bin/nutch readdb crawldb/ -dump dir
> http://www.baidu.com/ Version: 7
> Status: 3 (db_gone)
> Fetch time: Sat Aug 17 22:35:37 CST 2013
> Modified time: Thu Jan 01 08:00:00 CST 1970
> Retries since fetch: 0
> Retry interval: 3888000 seconds (45 days)
> Score: 1.0
> Signature: null
> Metadata: m1: v22m3: v3m2: v2m5: v5m4: m4_pst_: robots_denied(18), 
> lastModified=0m6: v6
> {code}
> so I improve the Metadata format to this
> {code:xml}
> Metadata: m1=v22;m3=v3;m2=v2;m5=v5;m4=m4;_pst_=robots_denied(18), 
> lastModified=0;m6=v6;
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1602) improve the readability of metadata in readdb dump normal

2013-07-04 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1602:
--

Attachment: NUTCH-1602-2.patch

> improve the readability of metadata in readdb dump normal 
> --
>
> Key: NUTCH-1602
> URL: https://issues.apache.org/jira/browse/NUTCH-1602
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Affects Versions: 1.7
>Reporter: lufeng
>Assignee: lufeng
>Priority: Minor
> Fix For: 1.8
>
> Attachments: NUTCH-1602-2.patch, NUTCH-1602.patch
>
>
> the dumped metadata format is not readable.
> {code:xml}
> $bin/nutch readdb crawldb/ -dump dir
> http://www.baidu.com/ Version: 7
> Status: 3 (db_gone)
> Fetch time: Sat Aug 17 22:35:37 CST 2013
> Modified time: Thu Jan 01 08:00:00 CST 1970
> Retries since fetch: 0
> Retry interval: 3888000 seconds (45 days)
> Score: 1.0
> Signature: null
> Metadata: m1: v22m3: v3m2: v2m5: v5m4: m4_pst_: robots_denied(18), 
> lastModified=0m6: v6
> {code}
> so I improve the Metadata format to this
> {code:xml}
> Metadata: m1=v22;m3=v3;m2=v2;m5=v5;m4=m4;_pst_=robots_denied(18), 
> lastModified=0;m6=v6;
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1602) improve the readability of metadata in readdb dump normal

2013-07-04 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13700082#comment-13700082
 ] 

lufeng commented on NUTCH-1602:
---

Hi Markus, this output format only used in *normal* output format, not within 
CSV output format. there are two different crawl datum output format. now the 
normal output like this, better than previous one.

{code:xml}
http://www.baidu.com/   Version: 7
Status: 3 (db_gone)
Fetch time: Sat Aug 17 22:35:37 CST 2013
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 3888000 seconds (45 days)
Score: 1.0
Signature: null
Metadata: 
m1=v22
m3=v3
m2=v2
m5=v5
m4=m4
_pst_=robots_denied(18), lastModified=0
m6=v6

{code}

thanks Julien and Tejas.

> improve the readability of metadata in readdb dump normal 
> --
>
> Key: NUTCH-1602
> URL: https://issues.apache.org/jira/browse/NUTCH-1602
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Affects Versions: 1.7
>Reporter: lufeng
>Assignee: lufeng
>Priority: Minor
> Fix For: 1.8
>
> Attachments: NUTCH-1602.patch
>
>
> the dumped metadata format is not readable.
> {code:xml}
> $bin/nutch readdb crawldb/ -dump dir
> http://www.baidu.com/ Version: 7
> Status: 3 (db_gone)
> Fetch time: Sat Aug 17 22:35:37 CST 2013
> Modified time: Thu Jan 01 08:00:00 CST 1970
> Retries since fetch: 0
> Retry interval: 3888000 seconds (45 days)
> Score: 1.0
> Signature: null
> Metadata: m1: v22m3: v3m2: v2m5: v5m4: m4_pst_: robots_denied(18), 
> lastModified=0m6: v6
> {code}
> so I improve the Metadata format to this
> {code:xml}
> Metadata: m1=v22;m3=v3;m2=v2;m5=v5;m4=m4;_pst_=robots_denied(18), 
> lastModified=0;m6=v6;
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (NUTCH-1602) improve the readability of metadata in readdb dump normal

2013-07-03 Thread lufeng (JIRA)
lufeng created NUTCH-1602:
-

 Summary: improve the readability of metadata in readdb dump normal 
 Key: NUTCH-1602
 URL: https://issues.apache.org/jira/browse/NUTCH-1602
 Project: Nutch
  Issue Type: Improvement
  Components: crawldb
Affects Versions: 1.7
Reporter: lufeng
Assignee: lufeng
Priority: Minor
 Fix For: 1.8


the dumped metadata format is not readable.

{code:xml}
$bin/nutch readdb crawldb/ -dump dir
http://www.baidu.com/   Version: 7
Status: 3 (db_gone)
Fetch time: Sat Aug 17 22:35:37 CST 2013
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 3888000 seconds (45 days)
Score: 1.0
Signature: null
Metadata: m1: v22m3: v3m2: v2m5: v5m4: m4_pst_: robots_denied(18), 
lastModified=0m6: v6
{code}

so I improve the Metadata format to this

{code:xml}
Metadata: m1=v22;m3=v3;m2=v2;m5=v5;m4=m4;_pst_=robots_denied(18), 
lastModified=0;m6=v6;
{code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1602) improve the readability of metadata in readdb dump normal

2013-07-03 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1602:
--

Attachment: NUTCH-1602.patch

> improve the readability of metadata in readdb dump normal 
> --
>
> Key: NUTCH-1602
> URL: https://issues.apache.org/jira/browse/NUTCH-1602
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Affects Versions: 1.7
>Reporter: lufeng
>Assignee: lufeng
>Priority: Minor
> Fix For: 1.8
>
> Attachments: NUTCH-1602.patch
>
>
> the dumped metadata format is not readable.
> {code:xml}
> $bin/nutch readdb crawldb/ -dump dir
> http://www.baidu.com/ Version: 7
> Status: 3 (db_gone)
> Fetch time: Sat Aug 17 22:35:37 CST 2013
> Modified time: Thu Jan 01 08:00:00 CST 1970
> Retries since fetch: 0
> Retry interval: 3888000 seconds (45 days)
> Score: 1.0
> Signature: null
> Metadata: m1: v22m3: v3m2: v2m5: v5m4: m4_pst_: robots_denied(18), 
> lastModified=0m6: v6
> {code}
> so I improve the Metadata format to this
> {code:xml}
> Metadata: m1=v22;m3=v3;m2=v2;m5=v5;m4=m4;_pst_=robots_denied(18), 
> lastModified=0;m6=v6;
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1600) Injector overwrite does not always work properly

2013-07-03 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13699034#comment-13699034
 ] 

lufeng commented on NUTCH-1600:
---

test work fine. 
+1

> Injector overwrite does not always work properly
> 
>
> Key: NUTCH-1600
> URL: https://issues.apache.org/jira/browse/NUTCH-1600
> Project: Nutch
>  Issue Type: Bug
>  Components: injector
>Affects Versions: 1.7
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.8
>
> Attachments: NUTCH-1600-1.8.patch
>
>
> db.injector.update works as it should but db.injector.overwrite doesn't 
> always seem to properly overwrite the record. This issue exists for some time 
> and we've already fixed it in our dist of Nutch.
> This record just has been updated (interval).
> {code}
> Injector: starting at 2013-07-03 10:34:15
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: seeds
> Injector: Converting injected urls to crawl db entries.
> Injector: total number of urls rejected by filters: 0
> Injector: total number of urls injected after normalization and filtering: 9
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2013-07-03 10:34:21, elapsed: 00:00:05
> URL: url
> Version: 7
> Status: 2 (db_fetched)
> Fetch time: Fri Jul 05 12:11:44 CEST 2013
> Modified time: Fri Jun 28 12:11:44 CEST 2013
> Retries since fetch: 0
> Retry interval: 604800 seconds (7 days)
> Score: 0.0
> Signature: ba29ef3e680323a6d0da74c156800e03
> Metadata: Content-Type: text/html_pst_: success(1), lastModified=0
> {code}
> If we now overwrite the record, nothing happens. With this patch installed it 
> overwrites the record as it should and also logs update & overwrite switches 
> to console:
> {code}
> Injector: starting at 2013-07-03 10:36:30
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: seeds
> Injector: Converting injected urls to crawl db entries.
> Injector: total number of urls rejected by filters: 0
> Injector: total number of urls injected after normalization and filtering: 9
> Injector: Merging injected urls into crawl db.
> Injector: overwrite: true
> Injector: update: false
> Injector: finished at 2013-07-03 10:36:36, elapsed: 00:00:05
> URL: url
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Wed Jul 03 10:36:30 CEST 2013
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 14000 seconds (0 days)
> Score: 1.0
> Signature: null
> Metadata: fixedInterval: 14000.0
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1581) CrawlDB csv output to include metadata

2013-07-01 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13696865#comment-13696865
 ] 

lufeng commented on NUTCH-1581:
---

I have tested it with nutch 1.x and works fine. 

+1

> CrawlDB csv output to include metadata
> --
>
> Key: NUTCH-1581
> URL: https://issues.apache.org/jira/browse/NUTCH-1581
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.8
>
> Attachments: NUTCH-1581-1.8.patch
>
>
> Dumping the CrawlDB to CSV should include the CrawlDatum's metadata.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1327) QueryStringNormalizer

2013-07-01 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13696854#comment-13696854
 ] 

lufeng commented on NUTCH-1327:
---

Hi Markus, I tested you patch, Do you forget to add deploy and test target into 
src/plugin/build.xml?

+1 

> QueryStringNormalizer
> -
>
> Key: NUTCH-1327
> URL: https://issues.apache.org/jira/browse/NUTCH-1327
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.9
>
> Attachments: NUTCH-1327-1.8-1.patch
>
>
> A normalizer for dealing with query strings. Sorting query strings is helpful 
> in preventing duplicates for some (bad) websites.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1594) count variable is never changed in ParseUtil class

2013-07-01 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13696798#comment-13696798
 ] 

lufeng commented on NUTCH-1594:
---

Committed @revision 1498437 in 2.x HEAD. Thanks Canan and Lewis.

> count variable is never changed in ParseUtil class
> --
>
> Key: NUTCH-1594
> URL: https://issues.apache.org/jira/browse/NUTCH-1594
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.2
>Reporter: lufeng
>Assignee: lufeng
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1594.patch
>
>
> in ParseUtil class the count variable is never change. the code is like this 
> for (int i = 0; count < maxOutlinks && i < outlinks.length; i++) 
> so even if you define the "db.max.outlinks.per.page" parameter, it will not 
> take effect.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (NUTCH-1594) count variable is never changed in ParseUtil class

2013-06-29 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng reassigned NUTCH-1594:
-

Assignee: lufeng

> count variable is never changed in ParseUtil class
> --
>
> Key: NUTCH-1594
> URL: https://issues.apache.org/jira/browse/NUTCH-1594
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.2
>Reporter: lufeng
>Assignee: lufeng
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1594.patch
>
>
> in ParseUtil class the count variable is never change. the code is like this 
> for (int i = 0; count < maxOutlinks && i < outlinks.length; i++) 
> so even if you define the "db.max.outlinks.per.page" parameter, it will not 
> take effect.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1594) count variable is never changed in ParseUtil class

2013-06-29 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1594:
--

Attachment: NUTCH-1594.patch

> count variable is never changed in ParseUtil class
> --
>
> Key: NUTCH-1594
> URL: https://issues.apache.org/jira/browse/NUTCH-1594
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.2
>Reporter: lufeng
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1594.patch
>
>
> in ParseUtil class the count variable is never change. the code is like this 
> for (int i = 0; count < maxOutlinks && i < outlinks.length; i++) 
> so even if you define the "db.max.outlinks.per.page" parameter, it will not 
> take effect.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1594) count variable is never changed in ParseUtil class

2013-06-29 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1594:
--

Patch Info: Patch Available

> count variable is never changed in ParseUtil class
> --
>
> Key: NUTCH-1594
> URL: https://issues.apache.org/jira/browse/NUTCH-1594
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.2
>Reporter: lufeng
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1594.patch
>
>
> in ParseUtil class the count variable is never change. the code is like this 
> for (int i = 0; count < maxOutlinks && i < outlinks.length; i++) 
> so even if you define the "db.max.outlinks.per.page" parameter, it will not 
> take effect.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1594) count variable is never changed in ParseUtil class

2013-06-29 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1594:
--

Description: 
in ParseUtil class the count variable is never change. the code is like this 
for (int i = 0; count < maxOutlinks && i < outlinks.length; i++) 

so even if you define the "db.max.outlinks.per.page" parameter, it will not 
take effect.

  was:
in ParseUtil class the count variable is never change. the code is like this 
for (int i = 0; count < maxOutlinks && i < outlinks.length; i++) 

Summary: count variable is never changed in ParseUtil class  (was: 
count variable is never in ParseUtil )

> count variable is never changed in ParseUtil class
> --
>
> Key: NUTCH-1594
> URL: https://issues.apache.org/jira/browse/NUTCH-1594
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.2
>Reporter: lufeng
>Priority: Minor
> Fix For: 2.3
>
>
> in ParseUtil class the count variable is never change. the code is like this 
> for (int i = 0; count < maxOutlinks && i < outlinks.length; i++) 
> so even if you define the "db.max.outlinks.per.page" parameter, it will not 
> take effect.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (NUTCH-1594) count variable is never in ParseUtil

2013-06-29 Thread lufeng (JIRA)
lufeng created NUTCH-1594:
-

 Summary: count variable is never in ParseUtil 
 Key: NUTCH-1594
 URL: https://issues.apache.org/jira/browse/NUTCH-1594
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 2.2
Reporter: lufeng
Priority: Minor
 Fix For: 2.3


in ParseUtil class the count variable is never change. the code is like this 
for (int i = 0; count < maxOutlinks && i < outlinks.length; i++) 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch

2013-06-18 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13686830#comment-13686830
 ] 

lufeng commented on NUTCH-1527:
---

Thanks Markus, I try the patch and can index the document success. +1 for 
commit.

> Port nutch-elasticsearch-indexer to Nutch
> -
>
> Key: NUTCH-1527
> URL: https://issues.apache.org/jira/browse/NUTCH-1527
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.6, 2.1
>Reporter: Lewis John McGibbney
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 2.4
>
> Attachments: NUTCH-1527.patch, NUTCH-1527.patch, NUTCH-1527.patch, 
> NUTCH-1527.patch, NUTCH-1527.patch
>
>
> The source repos for this can be found here [0].
> This issue should be inline with the work already done by Julien and others 
> over at NUTCH-1047.
> [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch

2013-06-17 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13685661#comment-13685661
 ] 

lufeng commented on NUTCH-1527:
---

Hi Markus, I have already tested the newest patch on my laptop. very cool. +1 
for commit.

{code:xml}
lemo@debian:~/Workspace/java/apache-workspace/nutch-svn/runtime/local$ 
bin/nutch index crawldb/ segmetns/20130617225826/
Indexer: starting at 2013-06-17 23:46:47
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
ElasticIndexWriter
elastic.cluster : elastic prefix cluster
elastic.index : elastic index command 
elastic.max.bulk.docs : elastic bulk index doc counts. (default 500) 
elastic.max.bulk.size : elastic bulk index length. (default 5001001 
~5MB)


Processing remaining requests [docs = 1, length = 7528, total docs = 1]
Processing to finalize last execute
Previous took in ms 27, including wait 21
Indexer: finished at 2013-06-17 23:46:57, elapsed: 00:00:10
{code}

but one question is that should we add elastic.cluster and elastic.index 
properties into the nutch-default.xml file?

> Port nutch-elasticsearch-indexer to Nutch
> -
>
> Key: NUTCH-1527
> URL: https://issues.apache.org/jira/browse/NUTCH-1527
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.6, 2.1
>Reporter: Lewis John McGibbney
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 2.4
>
> Attachments: NUTCH-1527.patch, NUTCH-1527.patch, NUTCH-1527.patch, 
> NUTCH-1527.patch
>
>
> The source repos for this can be found here [0].
> This issue should be inline with the work already done by Julien and others 
> over at NUTCH-1047.
> [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch

2013-06-13 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13682380#comment-13682380
 ] 

lufeng commented on NUTCH-1527:
---

Hi Markus

1. Elastic search will load the configure file first, so you need to add 
config/elasticsearch.yml in your runtime/local/config. But I don't find any 
method to load configure file with configuration.

2. do you still have lucene-core-3.4.jar in you runtime/local/lib directory?  
or do you add this

{code:xml}
+  
{code}

code in ivy/ivy.xml file. 

maybe the elasticsearch can not load class in nutch plugins system.


> Port nutch-elasticsearch-indexer to Nutch
> -
>
> Key: NUTCH-1527
> URL: https://issues.apache.org/jira/browse/NUTCH-1527
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.6, 2.1
>Reporter: Lewis John McGibbney
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 2.4
>
> Attachments: NUTCH-1527.patch, NUTCH-1527.patch, NUTCH-1527.patch
>
>
> The source repos for this can be found here [0].
> This issue should be inline with the work already done by Julien and others 
> over at NUTCH-1047.
> [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-1575) support solr authentication in nutch 2.x

2013-06-03 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng closed NUTCH-1575.
-


> support solr authentication in nutch 2.x
> 
>
> Key: NUTCH-1575
> URL: https://issues.apache.org/jira/browse/NUTCH-1575
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 2.1
>Reporter: lufeng
>Assignee: lufeng
>Priority: Minor
> Fix For: 2.2
>
> Attachments: NUTCH-1575.patch
>
>
> can solr authentication in nutch 2.x like 1.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1545) capture batchId and remove references to segments in 2.x crawl script.

2013-05-30 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng resolved NUTCH-1545.
---

Resolution: Fixed

> capture batchId and remove references to segments in 2.x crawl script.
> --
>
> Key: NUTCH-1545
> URL: https://issues.apache.org/jira/browse/NUTCH-1545
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 2.1
>Reporter: Lewis John McGibbney
>Assignee: lufeng
>Priority: Minor
> Fix For: 2.2
>
> Attachments: NUTCH-1545.patch, NUTCH-1545-v2.patch
>
>
> The concept of segment is replaced by batchId in 2.x
> I'm currently getting rid of segments references in 2.x
> This issue was flagged up and separate from NUTCH-1532 which I am working on.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1545) capture batchId and remove references to segments in 2.x crawl script.

2013-05-30 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1545:
--

Fix Version/s: (was: 2.3)
   2.2

> capture batchId and remove references to segments in 2.x crawl script.
> --
>
> Key: NUTCH-1545
> URL: https://issues.apache.org/jira/browse/NUTCH-1545
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 2.1
>Reporter: Lewis John McGibbney
>Assignee: lufeng
>Priority: Minor
> Fix For: 2.2
>
> Attachments: NUTCH-1545.patch, NUTCH-1545-v2.patch
>
>
> The concept of segment is replaced by batchId in 2.x
> I'm currently getting rid of segments references in 2.x
> This issue was flagged up and separate from NUTCH-1532 which I am working on.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1545) capture batchId and remove references to segments in 2.x crawl script.

2013-05-30 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13670376#comment-13670376
 ] 

lufeng commented on NUTCH-1545:
---

Committed for nutch 2.2 revision 1487875. by Feng. Thanks Tejas and Lewis.

> capture batchId and remove references to segments in 2.x crawl script.
> --
>
> Key: NUTCH-1545
> URL: https://issues.apache.org/jira/browse/NUTCH-1545
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 2.1
>Reporter: Lewis John McGibbney
>Assignee: lufeng
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1545.patch, NUTCH-1545-v2.patch
>
>
> The concept of segment is replaced by batchId in 2.x
> I'm currently getting rid of segments references in 2.x
> This issue was flagged up and separate from NUTCH-1532 which I am working on.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1575) support solr authentication in nutch 2.x

2013-05-29 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng resolved NUTCH-1575.
---

Resolution: Fixed

> support solr authentication in nutch 2.x
> 
>
> Key: NUTCH-1575
> URL: https://issues.apache.org/jira/browse/NUTCH-1575
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 2.1
>Reporter: lufeng
>Assignee: lufeng
>Priority: Minor
> Fix For: 2.2
>
> Attachments: NUTCH-1575.patch
>
>
> can solr authentication in nutch 2.x like 1.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1575) support solr authentication in nutch 2.x

2013-05-29 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13669351#comment-13669351
 ] 

lufeng commented on NUTCH-1575:
---

Committed for 2.2 revision 1487521 by Feng. Thanks Lewis

> support solr authentication in nutch 2.x
> 
>
> Key: NUTCH-1575
> URL: https://issues.apache.org/jira/browse/NUTCH-1575
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 2.1
>Reporter: lufeng
>Assignee: lufeng
>Priority: Minor
> Fix For: 2.2
>
> Attachments: NUTCH-1575.patch
>
>
> can solr authentication in nutch 2.x like 1.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1563) FetchSchedule#getFields is never used by GeneraterJob

2013-05-29 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng resolved NUTCH-1563.
---

Resolution: Fixed

> FetchSchedule#getFields is never used by GeneraterJob
> -
>
> Key: NUTCH-1563
> URL: https://issues.apache.org/jira/browse/NUTCH-1563
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 2.1
>Reporter: lufeng
>Assignee: lufeng
>Priority: Minor
> Fix For: 2.2
>
> Attachments: NUTCH-1563.patch
>
>
> The method of getFields in FetchSchedule if never used, so if user extends 
> the FetchSchedule and want to get some fields of WebPage, it always return 
> null.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-1563) FetchSchedule#getFields is never used by GeneraterJob

2013-05-29 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng closed NUTCH-1563.
-


> FetchSchedule#getFields is never used by GeneraterJob
> -
>
> Key: NUTCH-1563
> URL: https://issues.apache.org/jira/browse/NUTCH-1563
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 2.1
>Reporter: lufeng
>Assignee: lufeng
>Priority: Minor
> Fix For: 2.2
>
> Attachments: NUTCH-1563.patch
>
>
> The method of getFields in FetchSchedule if never used, so if user extends 
> the FetchSchedule and want to get some fields of WebPage, it always return 
> null.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch

2013-05-27 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13667775#comment-13667775
 ] 

lufeng commented on NUTCH-1527:
---

Hi luca, now you can click assign to me,and then attach you improvement patch, 
thanks luca.

> Port nutch-elasticsearch-indexer to Nutch
> -
>
> Key: NUTCH-1527
> URL: https://issues.apache.org/jira/browse/NUTCH-1527
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.6, 2.1
>Reporter: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.4
>
> Attachments: NUTCH-1527.patch
>
>
> The source repos for this can be found here [0].
> This issue should be inline with the work already done by Julien and others 
> over at NUTCH-1047.
> [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch

2013-05-27 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1527:
--

Assignee: (was: lufeng)

> Port nutch-elasticsearch-indexer to Nutch
> -
>
> Key: NUTCH-1527
> URL: https://issues.apache.org/jira/browse/NUTCH-1527
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.6, 2.1
>Reporter: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.4
>
> Attachments: NUTCH-1527.patch
>
>
> The source repos for this can be found here [0].
> This issue should be inline with the work already done by Julien and others 
> over at NUTCH-1047.
> [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch

2013-05-27 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13667766#comment-13667766
 ] 

lufeng commented on NUTCH-1527:
---

Hi luca,sorry for my delayed reply, yes, you can improve this patch follow
you suggestion, can I assign this issue to you, I am willing to testing it.
Thanks. Luca.




-- 
Don't Grow Old, Grow Up... :-)


> Port nutch-elasticsearch-indexer to Nutch
> -
>
> Key: NUTCH-1527
> URL: https://issues.apache.org/jira/browse/NUTCH-1527
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.6, 2.1
>Reporter: Lewis John McGibbney
>Assignee: lufeng
>Priority: Minor
> Fix For: 2.4
>
> Attachments: NUTCH-1527.patch
>
>
> The source repos for this can be found here [0].
> This issue should be inline with the work already done by Julien and others 
> over at NUTCH-1047.
> [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1563) FetchSchedule#getFields is never used by GeneraterJob

2013-05-23 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1563:
--

Fix Version/s: (was: 2.3)
   2.2

> FetchSchedule#getFields is never used by GeneraterJob
> -
>
> Key: NUTCH-1563
> URL: https://issues.apache.org/jira/browse/NUTCH-1563
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 2.1
>Reporter: lufeng
>Assignee: lufeng
>Priority: Minor
> Fix For: 2.2
>
> Attachments: NUTCH-1563.patch
>
>
> The method of getFields in FetchSchedule if never used, so if user extends 
> the FetchSchedule and want to get some fields of WebPage, it always return 
> null.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1563) FetchSchedule#getFields is never used by GeneraterJob

2013-05-23 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13665161#comment-13665161
 ] 

lufeng commented on NUTCH-1563:
---

hi Tejas

yes, I pushed this pathc to 2.x. 

https://svn.apache.org/repos/asf/nutch/branches/2.x

Do you mean that I pushed to wrong place?

> FetchSchedule#getFields is never used by GeneraterJob
> -
>
> Key: NUTCH-1563
> URL: https://issues.apache.org/jira/browse/NUTCH-1563
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 2.1
>Reporter: lufeng
>Assignee: lufeng
>Priority: Minor
> Fix For: 2.2
>
> Attachments: NUTCH-1563.patch
>
>
> The method of getFields in FetchSchedule if never used, so if user extends 
> the FetchSchedule and want to get some fields of WebPage, it always return 
> null.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1575) support solr authentication in nutch 2.x

2013-05-22 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1575:
--

Attachment: NUTCH-1575.patch

add solr authentication

> support solr authentication in nutch 2.x
> 
>
> Key: NUTCH-1575
> URL: https://issues.apache.org/jira/browse/NUTCH-1575
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 2.1
>Reporter: lufeng
>Assignee: lufeng
>Priority: Minor
> Fix For: 2.2
>
> Attachments: NUTCH-1575.patch
>
>
> can solr authentication in nutch 2.x like 1.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Work started] (NUTCH-1575) support solr authentication in nutch 2.x

2013-05-22 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-1575 started by lufeng.

> support solr authentication in nutch 2.x
> 
>
> Key: NUTCH-1575
> URL: https://issues.apache.org/jira/browse/NUTCH-1575
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 2.1
>Reporter: lufeng
>Assignee: lufeng
>Priority: Minor
> Fix For: 2.2
>
>
> can solr authentication in nutch 2.x like 1.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (NUTCH-1575) support solr authentication in nutch 2.x

2013-05-22 Thread lufeng (JIRA)
lufeng created NUTCH-1575:
-

 Summary: support solr authentication in nutch 2.x
 Key: NUTCH-1575
 URL: https://issues.apache.org/jira/browse/NUTCH-1575
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 2.1
Reporter: lufeng
Assignee: lufeng
Priority: Minor
 Fix For: 2.2


can solr authentication in nutch 2.x like 1.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1545) capture batchId and remove references to segments in 2.x crawl script.

2013-05-20 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13662057#comment-13662057
 ] 

lufeng commented on NUTCH-1545:
---

Hi Tejas

yes, the patch is just put random batchId generater from code to crawl script. 
User don't have to bother this.

> capture batchId and remove references to segments in 2.x crawl script.
> --
>
> Key: NUTCH-1545
> URL: https://issues.apache.org/jira/browse/NUTCH-1545
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 2.1
>Reporter: Lewis John McGibbney
>Assignee: lufeng
>Priority: Minor
> Fix For: 2.2
>
> Attachments: NUTCH-1545.patch, NUTCH-1545-v2.patch
>
>
> The concept of segment is replaced by batchId in 2.x
> I'm currently getting rid of segments references in 2.x
> This issue was flagged up and separate from NUTCH-1532 which I am working on.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1563) FetchSchedule#getFields is never used by GeneraterJob

2013-05-20 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1563:
--

Fix Version/s: (was: 2.3)
   2.2

> FetchSchedule#getFields is never used by GeneraterJob
> -
>
> Key: NUTCH-1563
> URL: https://issues.apache.org/jira/browse/NUTCH-1563
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 2.1
>Reporter: lufeng
>Assignee: lufeng
>Priority: Minor
> Fix For: 2.2
>
> Attachments: NUTCH-1563.patch
>
>
> The method of getFields in FetchSchedule if never used, so if user extends 
> the FetchSchedule and want to get some fields of WebPage, it always return 
> null.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1563) FetchSchedule#getFields is never used by GeneraterJob

2013-05-20 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13662003#comment-13662003
 ] 

lufeng commented on NUTCH-1563:
---

Committed for 2.2 revision 1484482 by Feng. Thanks Canan and Lewis.

> FetchSchedule#getFields is never used by GeneraterJob
> -
>
> Key: NUTCH-1563
> URL: https://issues.apache.org/jira/browse/NUTCH-1563
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 2.1
>Reporter: lufeng
>Assignee: lufeng
>Priority: Minor
> Fix For: 2.2
>
> Attachments: NUTCH-1563.patch
>
>
> The method of getFields in FetchSchedule if never used, so if user extends 
> the FetchSchedule and want to get some fields of WebPage, it always return 
> null.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch

2013-05-08 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1527:
--

Attachment: NUTCH-1527.patch

port elasticsearch indexer plugin to nutch trunk. Before u install this patch, 
you need to install the https://issues.apache.org/jira/browse/NUTCH-1486 first. 
so I use the newest version of elasticsearch 0.90.0. It use the lucene 4.2.1. I 
need more testing about this patch, I am a newbie to elastchsearch. Hope any 
comments about this patch.

thanks Lewis.

> Port nutch-elasticsearch-indexer to Nutch
> -
>
> Key: NUTCH-1527
> URL: https://issues.apache.org/jira/browse/NUTCH-1527
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.6, 2.1
>Reporter: Lewis John McGibbney
>Assignee: lufeng
>Priority: Minor
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1527.patch
>
>
> The source repos for this can be found here [0].
> This issue should be inline with the work already done by Julien and others 
> over at NUTCH-1047.
> [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1486) Upgrade to Solr 4.2.1

2013-05-08 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13651966#comment-13651966
 ] 

lufeng commented on NUTCH-1486:
---

and the version of lucene-core and solr-solrj in plugin.xml at indexer-solr 
directory is still 3.4.0. 

> Upgrade to Solr 4.2.1
> -
>
> Key: NUTCH-1486
> URL: https://issues.apache.org/jira/browse/NUTCH-1486
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.6, 2.1
> Environment: Solr 4.0, Nutch trunk 1.6-SNAPSHOT & Probably 2.2-SNAPHOT
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.7, 2.2
>
> Attachments: NUTCH-1486-2.x.patch, NUTCH-1486-2.x.v2.patch, 
> NUTCH-1486-nutchgora.patch, NUTCH-1486-trunk.patch, NUTCH-1486-trunk.v2.patch
>
>
> When attempting to configure a 4 multicore 4.0 instance with Nutch 
> schema-solr4.xml file, I get the following exceptions.
> This has been discussed previously. As I see it we have two options
> 1. Keep maintaining both schema options
> 2. Ditch the more complex schema-solr4.xml in favour of vanilla schema.xml
> Thoughts?
> {code}
> SEVERE: Unable to create core: collection4
> org.apache.solr.common.SolrException: Unable to use updateLog: _version_field 
> must exist in schema, using indexed="true" stored="true" and 
> multiValued="false" (_version_ does not exist)
>   at org.apache.solr.core.SolrCore.(SolrCore.java:721)
>   at org.apache.solr.core.SolrCore.(SolrCore.java:566)
>   at org.apache.solr.core.CoreContainer.create(CoreContainer.java:850)
>   at org.apache.solr.core.CoreContainer.load(CoreContainer.java:534)
>   at org.apache.solr.core.CoreContainer.load(CoreContainer.java:356)
>   at 
> org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:308)
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:107)
>   at org.eclipse.jetty.servlet.FilterHolder.doStart(FilterHolder.java:114)
>   at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
>   at 
> org.eclipse.jetty.servlet.ServletHandler.initialize(ServletHandler.java:754)
>   at 
> org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:258)
>   at 
> org.eclipse.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1221)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:699)
>   at 
> org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:454)
>   at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
>   at 
> org.eclipse.jetty.deploy.bindings.StandardStarter.processBinding(StandardStarter.java:36)
>   at 
> org.eclipse.jetty.deploy.AppLifeCycle.runBindings(AppLifeCycle.java:183)
>   at 
> org.eclipse.jetty.deploy.DeploymentManager.requestAppGoal(DeploymentManager.java:491)
>   at 
> org.eclipse.jetty.deploy.DeploymentManager.addApp(DeploymentManager.java:138)
>   at 
> org.eclipse.jetty.deploy.providers.ScanningAppProvider.fileAdded(ScanningAppProvider.java:142)
>   at 
> org.eclipse.jetty.deploy.providers.ScanningAppProvider$1.fileAdded(ScanningAppProvider.java:53)
>   at org.eclipse.jetty.util.Scanner.reportAddition(Scanner.java:604)
>   at org.eclipse.jetty.util.Scanner.reportDifferences(Scanner.java:535)
>   at org.eclipse.jetty.util.Scanner.scan(Scanner.java:398)
>   at org.eclipse.jetty.util.Scanner.doStart(Scanner.java:332)
>   at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
>   at 
> org.eclipse.jetty.deploy.providers.ScanningAppProvider.doStart(ScanningAppProvider.java:118)
>   at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
>   at 
> org.eclipse.jetty.deploy.DeploymentManager.startAppProvider(DeploymentManager.java:552)
>   at 
> org.eclipse.jetty.deploy.DeploymentManager.doStart(DeploymentManager.java:227)
>   at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
>   at 
> org.eclipse.jetty.util.component.AggregateLifeCycle.doStart(AggregateLifeCycle.java:63)
>   at 
> org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:53)
>   at 
> org.eclipse.jetty.server.handler.HandlerWrapper.doStart(HandlerWrapper.java:91)
>   at org.eclipse.jetty.server.Server.doStart(Server.java:263)
>   at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
>   at 
> org.eclipse.jetty.xml.XmlConfiguration$1.run(XmlConfiguration.java:1215)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at 
> org.eclipse.jetty.xml.XmlConfiguration.main(XmlConfiguration.java

[jira] [Assigned] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch

2013-05-08 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng reassigned NUTCH-1527:
-

Assignee: lufeng

> Port nutch-elasticsearch-indexer to Nutch
> -
>
> Key: NUTCH-1527
> URL: https://issues.apache.org/jira/browse/NUTCH-1527
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.6, 2.1
>Reporter: Lewis John McGibbney
>Assignee: lufeng
>Priority: Minor
> Fix For: 2.3, 1.8
>
>
> The source repos for this can be found here [0].
> This issue should be inline with the work already done by Julien and others 
> over at NUTCH-1047.
> [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1486) Upgrade to Solr 4.2.1

2013-05-08 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13651936#comment-13651936
 ] 

lufeng commented on NUTCH-1486:
---

Hi Lewis
The dependency version of solr-solrj in pom.xml is still 3.1.0. Should we 
upgrade it to 4.2.1.

> Upgrade to Solr 4.2.1
> -
>
> Key: NUTCH-1486
> URL: https://issues.apache.org/jira/browse/NUTCH-1486
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.6, 2.1
> Environment: Solr 4.0, Nutch trunk 1.6-SNAPSHOT & Probably 2.2-SNAPHOT
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.7, 2.2
>
> Attachments: NUTCH-1486-2.x.patch, NUTCH-1486-2.x.v2.patch, 
> NUTCH-1486-nutchgora.patch, NUTCH-1486-trunk.patch, NUTCH-1486-trunk.v2.patch
>
>
> When attempting to configure a 4 multicore 4.0 instance with Nutch 
> schema-solr4.xml file, I get the following exceptions.
> This has been discussed previously. As I see it we have two options
> 1. Keep maintaining both schema options
> 2. Ditch the more complex schema-solr4.xml in favour of vanilla schema.xml
> Thoughts?
> {code}
> SEVERE: Unable to create core: collection4
> org.apache.solr.common.SolrException: Unable to use updateLog: _version_field 
> must exist in schema, using indexed="true" stored="true" and 
> multiValued="false" (_version_ does not exist)
>   at org.apache.solr.core.SolrCore.(SolrCore.java:721)
>   at org.apache.solr.core.SolrCore.(SolrCore.java:566)
>   at org.apache.solr.core.CoreContainer.create(CoreContainer.java:850)
>   at org.apache.solr.core.CoreContainer.load(CoreContainer.java:534)
>   at org.apache.solr.core.CoreContainer.load(CoreContainer.java:356)
>   at 
> org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:308)
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:107)
>   at org.eclipse.jetty.servlet.FilterHolder.doStart(FilterHolder.java:114)
>   at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
>   at 
> org.eclipse.jetty.servlet.ServletHandler.initialize(ServletHandler.java:754)
>   at 
> org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:258)
>   at 
> org.eclipse.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1221)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:699)
>   at 
> org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:454)
>   at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
>   at 
> org.eclipse.jetty.deploy.bindings.StandardStarter.processBinding(StandardStarter.java:36)
>   at 
> org.eclipse.jetty.deploy.AppLifeCycle.runBindings(AppLifeCycle.java:183)
>   at 
> org.eclipse.jetty.deploy.DeploymentManager.requestAppGoal(DeploymentManager.java:491)
>   at 
> org.eclipse.jetty.deploy.DeploymentManager.addApp(DeploymentManager.java:138)
>   at 
> org.eclipse.jetty.deploy.providers.ScanningAppProvider.fileAdded(ScanningAppProvider.java:142)
>   at 
> org.eclipse.jetty.deploy.providers.ScanningAppProvider$1.fileAdded(ScanningAppProvider.java:53)
>   at org.eclipse.jetty.util.Scanner.reportAddition(Scanner.java:604)
>   at org.eclipse.jetty.util.Scanner.reportDifferences(Scanner.java:535)
>   at org.eclipse.jetty.util.Scanner.scan(Scanner.java:398)
>   at org.eclipse.jetty.util.Scanner.doStart(Scanner.java:332)
>   at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
>   at 
> org.eclipse.jetty.deploy.providers.ScanningAppProvider.doStart(ScanningAppProvider.java:118)
>   at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
>   at 
> org.eclipse.jetty.deploy.DeploymentManager.startAppProvider(DeploymentManager.java:552)
>   at 
> org.eclipse.jetty.deploy.DeploymentManager.doStart(DeploymentManager.java:227)
>   at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
>   at 
> org.eclipse.jetty.util.component.AggregateLifeCycle.doStart(AggregateLifeCycle.java:63)
>   at 
> org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:53)
>   at 
> org.eclipse.jetty.server.handler.HandlerWrapper.doStart(HandlerWrapper.java:91)
>   at org.eclipse.jetty.server.Server.doStart(Server.java:263)
>   at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
>   at 
> org.eclipse.jetty.xml.XmlConfiguration$1.run(XmlConfiguration.java:1215)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at 
> org.eclipse.jetty.xml.XmlConfiguration.main(XmlConfiguration.jav

[jira] [Comment Edited] (NUTCH-1555) Move to commons-cli for command line parsing

2013-04-25 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13641869#comment-13641869
 ] 

lufeng edited comment on NUTCH-1555 at 4/25/13 2:48 PM:


Lewis:
1. fixed the fetch NPE bug
2. fixed the update not work bug

Should we put every tools to use commons-cli? I find that there are 47 files 
need to be moved.

Sebastian:
1. use eclipse-codeformat.xml to format the source code

Thanks Lewis and Sebastian.

  was (Author: amuseme.lu):
Lewis:
1. fixed the fetch NPE bug
2. fixed the update not work bug

Should we put every tools to use commons-cli? I find that there are 47 files 
need to be moved.

[~wastl-nagel]
1. use eclipse-codeformat.xml to format the source code

Thanks Lewis and Sebastian.
  
> Move to commons-cli for command line parsing 
> -
>
> Key: NUTCH-1555
> URL: https://issues.apache.org/jira/browse/NUTCH-1555
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.1
>Reporter: Lewis John McGibbney
>Assignee: lufeng
> Fix For: 2.2
>
> Attachments: NUTCH-1555.patch, NUTCH-1555-v1.patch
>
>
> I just accidentally passed in the following argument to parser job
> {code}
> law@CEE279Law3-Linux:~/Downloads/asf/2.x/runtime/local$ ./bin/nutch parse 
> updatedb
> ParserJob: starting
> ParserJob: resuming:  false
> ParserJob: forced reparse:false
> ParserJob: batchId:   updatedb
> ParserJob: success
> {code}
> This is a bug for sure

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1555) Move to commons-cli for command line parsing

2013-04-25 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1555:
--

Attachment: NUTCH-1555-v1.patch

Lewis:
1. fixed the fetch NPE bug
2. fixed the update not work bug

Should we put every tools to use commons-cli? I find that there are 47 files 
need to be moved.

[~wastl-nagel]
1. use eclipse-codeformat.xml to format the source code

Thanks Lewis and Sebastian.

> Move to commons-cli for command line parsing 
> -
>
> Key: NUTCH-1555
> URL: https://issues.apache.org/jira/browse/NUTCH-1555
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.1
>Reporter: Lewis John McGibbney
>Assignee: lufeng
> Fix For: 2.2
>
> Attachments: NUTCH-1555.patch, NUTCH-1555-v1.patch
>
>
> I just accidentally passed in the following argument to parser job
> {code}
> law@CEE279Law3-Linux:~/Downloads/asf/2.x/runtime/local$ ./bin/nutch parse 
> updatedb
> ParserJob: starting
> ParserJob: resuming:  false
> ParserJob: forced reparse:false
> ParserJob: batchId:   updatedb
> ParserJob: success
> {code}
> This is a bug for sure

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


  1   2   >