Tomcat adds file:/// to searcher.dir path

2010-12-23 Thread alxsss
Hello,

I have installed nutch-1.2 in Fedora 14 and tomcat6. I added  path  to crawl 
dir in searcher.dir property in WEB_INF/classes/nutch-default.xml
as /home/user/nutch-1.2/crawl

I see in catalina.out file 
 WARN  SearchBean - Neither file:///home/user/nutch-1.2/crawl/index nor 
file:///home/home/nutch-1.2/crawl/indexes found!

I think the problem is that Tomcat adds file:// to the searcher.dir , because 
both folder are there and permission is 777
Any ideas how to fix this issue?

Thanks.
A.


failed with: java.net.UnknownHostException

2010-12-27 Thread alxsss
Hello

I use nutch-1.2 with fedora 14 and try to index about 4000 domains. I use 
bin/nutch crawl urls -dir crawl -depth 3 topN -1 and have in 
crawl-urlfilter.txt this
# accept hosts in MY.DOMAIN.NAME
 +^http://([a-z0-9]*\.)* 

I noticed that if a domain has entered like http://mydomain.com in the seed 
file, nutch gives error
failed with: java.net.UnknownHostException for some domains.

If, however, I enter the same domain with www like http://www.mydomain.com 
nutch does not give any errors.

Since, if we enter the http://mydomain.com in the browser it redirects to 
http://www.mydomain.com
I thought this might be a bug in nutch.

Any thoughts how to fix this issue?

Thanks.
Alex.


unnecessary results in search

2011-01-03 Thread alxsss
Hello,

I used nutch-1.2 to index a few domains. I noticed that nutch correctly crawled 
all sub-pages of domains. By sub-pages I mean the followings, for example for a 
domain mydomain.com all links inside it like
mydomain.com/show/photos/1 and etc. I also noticed in our apache logs that 
google-bot also crawled all sub-pages.
However, in search for mydomain.com google gives mydomain.com in the first page 
and almost no subpages, but nutch gives all subpages. If a domain has, let say 
200 sub-pages and we display 10 results in a page then it would take us 10 
pages to go forward to see results from other domains. In contrary google 
displays results form ohter domains in the second place.

Is there a way of fixing this issue?

Thanks in advance.
Alex.




Re: unnecessary results in search

2011-01-04 Thread alxsss
Hello,

Thanks you for your response. 

Let me give you more detail of the issue that I have.
First definitions. Let say I have my own domain that I host on a dedicated 
server and call it mydomain.com
Next, call subdomain the followings answers.mydomain.com, mail.mydomain.com, 
maps.mydomain.com and etc.
Call subpages the followings mydomain.com/show/photos/1, 
mydomain.com/forum/id/5 and etc.

Having these definitions, I have observed by examinig apache log files that 
Google and Nutch crawlers crawled all subpages of mydomain.com
However, if we search in google for keyword mydomain.com it gives in results 
all subdomains of mydomain.com not all subpages, maybe some of them. If we 
search in Nutch for the keyword mydomain.com it gives all subdomains and 
subpages. My concern was not to include all subpages in a search for keyword 
mydomain.com. Of course, we must see subpages  for keywords that is in that 
subpage. This means we must not remove subpages from index.

I hope this gives you more detail of the issue that I have.

Thanks.
Alex.



 

 


 

 

-Original Message-
From: Gora Mohanty g...@mimirtech.com
To: user user@nutch.apache.org
Sent: Tue, Jan 4, 2011 3:28 am
Subject: Re: unnecessary results in search


On Tue, Jan 4, 2011 at 5:40 AM,  alx...@aim.com wrote:

 Hello,



 I used nutch-1.2 to index a few domains. I noticed that nutch correctly 

crawled all sub-pages of domains. By sub-pages I mean the followings, for 

example for a domain mydomain.com all links inside it like

 mydomain.com/show/photos/1 and etc. I also noticed in our apache logs that 

google-bot also crawled all sub-pages.

 However, in search for mydomain.com google gives mydomain.com in the first 

page and almost no subpages, but nutch gives all subpages. If a domain has, let 

say 200 sub-pages and we display 10 results in a page then it would take us 10 

pages to go forward to see results from other domains. In contrary google 

displays results form ohter domains in the second place.

[...]



It is not entirely clear what you want:

* If your goal is to only crawl to a certain depth on a domain, you can

  use the -depth argument for the Nutch crawl, or use the -topN option

  to specify the max. number of pages to retrieve.

* Can you give an actual example of what you are searching for.

  It is difficult to understand your description above. E.g., searching

  Google for yahoo.com returns many, many links from yahoo.com.

* If you mean that a search with any query string returns different

  results between Google, and Nutch, that could be due to many

  reasons. In both cases, the returned pages are ranked by relevancy,

  but the algorithm is different. Also, Google has probably indexed many

  more sites than your Nutch crawl.



Regards,

Gora




 


Re: Exception on segment merging

2011-01-04 Thread alxsss
Which command did you use? Merging segments is very expensive in resources, so 
I try to avoid merging them. 
 


 

 

-Original Message-
From: Marseld Dedgjonaj marseld.dedgjo...@ikubinfo.com
To: user user@nutch.apache.org
Sent: Tue, Jan 4, 2011 7:12 am
Subject: FW: Exception on segment merging


I see in hadup log and some more details about the exception are there.

Please help me what to check for this error.



Here are the details:



2011-01-04 07:40:23,999 INFO  segment.SegmentMerger - Slice size: 5

URLs.

2011-01-04 07:40:36,563 INFO  segment.SegmentMerger - Slice size: 5

URLs.

2011-01-04 07:40:36,563 INFO  segment.SegmentMerger - Slice size: 5

URLs.

2011-01-04 07:40:43,685 INFO  segment.SegmentMerger - Slice size: 5

URLs.

2011-01-04 07:40:43,686 INFO  segment.SegmentMerger - Slice size: 5

URLs.

2011-01-04 07:40:47,316 WARN  mapred.LocalJobRunner - job_local_0001

java.io.IOException: Spill failed

at

org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1

044)

at java.io.DataOutputStream.write(DataOutputStream.java:90)

at org.apache.hadoop.io.Text.writeString(Text.java:412)

at org.apache.nutch.metadata.Metadata.write(Metadata.java:220)

at org.apache.nutch.protocol.Content.write(Content.java:170)

at

org.apache.hadoop.io.GenericWritable.write(GenericWritable.java:135)

at org.apache.nutch.metadata.MetaWrapper.write(MetaWrapper.java:107)

at

org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.ser

ialize(WritableSerialization.java:90)

at

org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.ser

ialize(WritableSerialization.java:77)

at

org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:900)

at

org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:466

)

at

org.apache.nutch.segment.SegmentMerger.map(SegmentMerger.java:361)

at

org.apache.nutch.segment.SegmentMerger.map(SegmentMerger.java:113)

at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)

at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)

at

org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)

Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not

find any valid local directory for

taskTracker/jobcache/job_local_0001/attempt_local_0001_m_32_0/output/spi

ll0.out

at

org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWr

ite(LocalDirAllocator.java:343)

at

org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocato

r.java:124)

at

org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.ja

va:107)

at

org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1

221)

at

org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:68

6)

at

org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.jav

a:1173)





-Original Message-

From: Marseld Dedgjonaj [mailto:marseld.dedgjo...@ikubinfo.com] 

Sent: Tuesday, January 04, 2011 1:28 PM

To: user@nutch.apache.org

Subject: Exception on segment merging



Hello everybody,



I have configured nutch-1.2 to crawl all urls of a specific website. 



It runs fine for a while but now that the number of indexed urls has grown

more than 30'000,  I got an exception on segment merging.



Have anybody seen this kind of error.



 



The exception is shown below.



 



Slice size: 5 URLs.





Slice size: 5 URLs.





Slice size: 5 URLs.





Slice size: 5 URLs.





Slice size: 5 URLs.





Exception in thread main java.io.IOException: Job failed!





at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)





at

org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:638)





at

org.apache.nutch.segment.SegmentMerger.main(SegmentMerger.java:683)





Merge Segments-  End at:   04-01-2011 07:40:48 



 



Thanks in advance  Best Regards,



Marseldi











p class=MsoNormalspan style=color: rgb(31, 73, 125);Gjeni bPuneuml; 

teuml; Mireuml;/b dhe bteuml; Mireuml; peuml;r Puneuml;/b... 

Vizitoni: a target=_blank 
href=http://www.punaime.al/;www.punaime.al/a/span/p

pa target=_blank href=http://www.punaime.al/;span 
style=text-decoration: 

none;img width=165 height=31 border=0 alt=punaime 

src=http://www.ikub.al/images/punaime.al_small.png; //span/a/p








 


Re: unnecessary results in search

2011-01-06 Thread alxsss
One more thing I just noticed is that Nutch search results do not display 
information from meta tag. 
Google and yahoo does. 
In more details, Nutch search results for keyword mydomain.com displays  some 
short text from page mydomain.com. In contrary, google and yahoo search results 
for the same keyword display words from meta tag.

How this can be fixed in Nutch?

Thanks.
Alex.

 

 


 

 

-Original Message-
From: Gora Mohanty g...@mimirtech.com
To: user user@nutch.apache.org
Sent: Wed, Jan 5, 2011 10:20 am
Subject: Re: unnecessary results in search


On Wed, Jan 5, 2011 at 11:25 PM,  alx...@aim.com wrote:

 I do search directly in Nutch version 1-2.

 I think google gives very low scores to subpages of a domain and higher 
 scores 

to other domains for a given keyword.



That is possible, though I am not sure why the situation is different with

non-popular domains.



 This must be so because if  mydomain.com has let say 2000 subpages then in 
 the 

search result for keyword mydomain.com  the next 200 pages all will be subpages 

of mydomain.com.



 If someone could direct me to the part of the source code where Nutch gives 

scores to pages I can take a look to it.



If you are using Nutch for search also, I am afraid that someone else

will have to help you. I have no experience there.



Regards,

Gora




 


Re: unnecessary results in search

2011-01-10 Thread alxsss
Hello,

Just noticed that google actually has results from all subpages of mydomain.com 
for keyword mydomain.com but they are hidden in a link show more results from 
mydomain.com. Is there a way of putting more results from the same domain in 
such a link in Nutch rss feed, since I use opensearch to display results from 
nutch.

Thanks.
Alex.


 

 


 

 

-Original Message-
From: Gora Mohanty g...@mimirtech.com
To: user user@nutch.apache.org
Sent: Wed, Jan 5, 2011 10:20 am
Subject: Re: unnecessary results in search


On Wed, Jan 5, 2011 at 11:25 PM,  alx...@aim.com wrote:

 I do search directly in Nutch version 1-2.

 I think google gives very low scores to subpages of a domain and higher 
 scores 

to other domains for a given keyword.



That is possible, though I am not sure why the situation is different with

non-popular domains.



 This must be so because if  mydomain.com has let say 2000 subpages then in 
 the 

search result for keyword mydomain.com  the next 200 pages all will be subpages 

of mydomain.com.



 If someone could direct me to the part of the source code where Nutch gives 

scores to pages I can take a look to it.



If you are using Nutch for search also, I am afraid that someone else

will have to help you. I have no experience there.



Regards,

Gora




 


Re: Few questions from a newbie

2011-01-26 Thread alxsss
you can put fetch external and internal links to false and increase depth.
 

 


 

 

-Original Message-
From: Churchill Nanje Mambe mambena...@afrovisiongroup.com
To: user user@nutch.apache.org
Sent: Wed, Jan 26, 2011 8:03 am
Subject: Re: Few questions from a newbie


even if the url being crawled is shortened, it will still lead nutch to the

actual link and nutch will fetch it




 


nutch crawl command takes 98% of cpu

2011-01-27 Thread alxsss
Hello,

I run crawl command with -depth 7 -topN -1 on my linux box with 1.5Mps 
internet, amd 3.1ghz processor,  4GB memory, Fedora Linux 14, nutch 1.2. After 
1-2 days nutch takes 98% of cpu. My seed file includes about 3500 domains and I 
put fetch.external links to false.

Is this normal? If not, what can be done to improve it?

Thanks.
Alex.


Re: Nutch search result

2011-02-18 Thread alxsss

 2nd, after testing to fetch several pages from wikipedia, the search
 query with e.g. bin/nutch org.apache.nutch.searcher.NutchBean apache
 ../wiki_dir returns

It returns a result for keyword apache  because that url has apache in it.



-topN 50), it actually fetches some pages e.g. `fetching

http://www.plurk.com/t/Brazil'). I am confused the differences between

using crawl command and step-by-step crawling.
crawling with crawl command (bin/nutch crawl urls -dir crawl -depth 3

 


 In order to get the same fetching in step by step approach you need to do 
fetching 3 times because you have depth 3 in crawl command


 

-Original Message-
From: Thomas Anderson t.dt.aander...@gmail.com
To: user user@nutch.apache.org
Sent: Fri, Feb 18, 2011 9:10 pm
Subject: Re: Nutch search result


The version used is nutch 1.1. OS is debian testing. Java version is 1.6.0_23.



The first question raises from when testing to fetch plurk.com. The

url specified at the inject stage only contains e.g. http://plurk.com.

After going through the steps described in the tutorial, I notice no

`fetching http:// ... ' key words were displayed on console. But when

crawling with crawl command (bin/nutch crawl urls -dir crawl -depth 3

-topN 50), it actually fetches some pages e.g. `fetching

http://www.plurk.com/t/Brazil'). I am confused the differences between

using crawl command and step-by-step crawling.



When fetching wikipedia, the url specified is http://en.wikipedia.org.

No ibm related url exists. But the file containing wiki url is resided

under wiki folder where also stores crawldb, segments, etc.



Thanks for help.



On Fri, Feb 18, 2011 at 7:27 PM, McGibbney, Lewis John

lewis.mcgibb...@gcu.ac.uk wrote:

 Hi Thomas



 Firstly which dist are you using?



 ___

 From: Thomas Anderson [t.dt.aander...@gmail.com]

 Sent: 18 February 2011 10:11

 To: user@nutch.apache.org

 Subject: Nutch search result



 I follow the NutchTutorial and get the search worked, but I have

 several questions.



 1st, is it possible for a website to setup some restriction so that

 nutch can not fetch its pages or the pages fetched is limited under

 some condition? If so, what file (e.g. robots.txt?) nutch would

 respect in order to avoid fetching specific pages?



 For this can you please specify your use scenario. If You hve a website, with 

certain areas, which you wish not to be crawled then I would assume a robots 

file would suffice. Inversely, if you wish to restict Nuch from crawling 
certain 

pages of specific domains then I imagine you would be looking at a different 

config of crawl-urlfilter





 2nd, after testing to fetch several pages from wikipedia, the search

 query with e.g. bin/nutch org.apache.nutch.searcher.NutchBean apache

 ../wiki_dir returns



Total hits: 1

 0 20110218171640/http://en.wikipedia.org/wiki/IBM

IBM - Wikipedia, the free encyclopedia IBM From Wikipedia, the

 free encyclopedia Jump to:  ...



 I'm afraid that I completely loose you here. Have you specified some IBM page 

within your /wiki_dir ? If so, it might be the case that Nutch has not fetched 

pages for a certain reason E.g. politeness rules. Can anyone advise on this 

please?







 This seeming does not relate to apache, any reason that may explain

 the reason it returns IBM? Or any execution step below may go wrong?



bin/nutch inject ../wiki/crawldb urls



bin/nutch generate ../wiki/crawldb ../wiki/segments

bin/nutch fetch `ls -d ../wiki/segments/2* | tail -1`

bin/nutch updatedb ../wiki/crawldb `ls -d ../wiki/segments/2* | tail -1`



bin/nutch generate ../wiki/crawldb ../wiki/segments -topN 100

bin/nutch fetch `ls -d ../wiki/segments/2* | tail -1`

bin/nutch updatedb ../wiki/crawldb `ls -d ../wiki/segments/2* | tail -1`



bin/nutch generate ../wiki/crawldb ../wiki/segments -topN 100

bin/nutch fetch `ls -d ../wiki/segments/2* | tail -1`

bin/nutch updatedb ../wiki/crawldb `ls -d ../wiki/segments/2* | tail -1`



bin/nutch invertlinks ../wiki/linkdb -dir ../wiki/segments

bin/nutch index ../wiki/indexes ../wiki/crawldb ../wiki/linkdb

 ../wiki/segments/*



 In addition, why only the third round 'generate, fetch, and updatedb'

 will actually fetch pages while the second round only replies it is

 done?



 The second round message



Fetcher: Your 'http.agent.name' value should be listed first in

 'http.robots.agents' property.

Fetcher: starting

Fetcher: segment: ../wiki/segments/20110218171338

^[OFFetcher: threads: 10

QueueFeeder finished: total 1 records + hit by time limit :0

-finishing thread FetcherThread, activeThreads=1

fetching http://en.wikipedia.org/wiki/Main_Page

-finishing thread FetcherThread, activeThreads=1

-finishing thread FetcherThread, activeThreads=1

-finishing thread FetcherThread, activeThreads=1

-finishing thread FetcherThread, 

Re: Starting web frontend

2011-02-24 Thread alxsss

 

 Hello,

I wondered if there is a way of adding to solrindex made from nutch segments 
another solrindex also made from nutch segments.
I have to index about 3000 domains but 5 of them are newspaper sites. So, I 
need to crawl-fetch-parse these 5 domains(with depth 2) and update index every 
day or so. The rest is crawled and indexed once a month.

Thanks.
Alex.


 

 

-Original Message-
From: Markus Jelsma markus.jel...@openindex.io
To: Jeremy Arnold jer...@possiblyfaulty.com
Cc: user user@nutch.apache.org
Sent: Thu, Feb 24, 2011 3:46 pm
Subject: Re: Starting web frontend


 Thanks for the reply Mark.

 

 So this means Nutch is really only going to be used for crawling now?

 Are there any plans for a JSON/XML RPC interface to using Nutch like

 Solr supports?



Yes, Nutch is going to focus to the fetch and parse jobs. Andrzej was working 

on a REST interface to control these jobs. This is part of 2.0.



 

 I am interested in a tight app integration where I can easily start

 crawls of new sites, and add/remove things from the index quickly. I

 guess I can rely directly on Solr for adding/removing from the index

 as well, or would you recommend this going through nutch?



Removing items from the index can be forced from Solr and Nutch. Solr provides 

easy methods to remove documents or documents that are the result of some 

query. Nutch can deduplicate (1.2+ and 2.0) and possibly remove 404 pages (1.3 

and 2.0) but the latter is not committed.



 

 

 Thanks,

 Jeremy

 

 On Thu, Feb 24, 2011 at 12:23 PM, Markus Jelsma

 

 markus.jel...@openindex.io wrote:

  Hi Jeremy,

  

  Nutch' own search server is in the process of being deprecated, Nutch 1.2

  was the last release to provide the search server. Please consider using

  Apache Solr as your search server.

  

  Cheers,

  

  I recently installed Nutch and have spent some time trying to get it

  working with limited success.

  

  ./nutch crawl urls -dir crawl -depth 5 -topN 50

  

  After the crawl completes I am trying to run the web frontend with the

  following command:

  

  ./nutch server 8080 crawl

  

  The server seems to be running (no output on the command line), but

  when I hit localhost:8080 I get a Error 324 (net::ERR_EMPTY_RESPONSE):

  Unknown error. Any ideas on how to get past this?

  

  I've been using  this tutorial to get started.

  http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine

  

  

  Thanks,

  Jeremy




 


Re: Reload index without restart tomcat.

2011-03-08 Thread alxsss
That tutorial is applicable for the new version too.
 

 


 

 

-Original Message-
From: Marseld Dedgjonaj marseld.dedgjo...@ikubinfo.com
To: user user@nutch.apache.org; 'McGibbney, Lewis John' 
lewis.mcgibb...@gcu.ac.uk
Sent: Tue, Mar 8, 2011 5:25 am
Subject: RE: Reload index without restart tomcat.


Hi Lewis,

Thanks for your help. I tried to find any tutorial to integrate Nutch-1.2 with 

Solr but all what I found are for old versions of nutch(nutch-1.0). 

Please if you have any tutorial for nutch 1.2, send it to me.



Regards,

Marseldi



-Original Message-

From: McGibbney, Lewis John [mailto:lewis.mcgibb...@gcu.ac.uk] 

Sent: Monday, March 07, 2011 6:55 PM

To: user@nutch.apache.org

Subject: RE: Reload index without restart tomcat.



Hi Marseld,



You need to configure it and can be done in a number of ways (assuming you are 

using Nutch-1.2)



1) Individual commands when attempting a whole web crawl, the solrindex option 

should be used to pass indexing to Solr

2) pass -solr http://blahblahblah as a parameter when using crawl command



Obviously there are a number of issues such as a suitable Solr schema for field 

matching but you should be able to find most info on this by combining posts 

from both Nutch and Solr wiki respectively.



Hope this helps



Lewis



From: Marseld Dedgjonaj [marseld.dedgjo...@ikubinfo.com]

Sent: 07 March 2011 17:53

To: user@nutch.apache.org

Subject: Reload index without restart tomcat.



Hello Everybody,



I am trying to reload the index without restart of the tomcat.



I see in other topics that this is not possible in nutch without solr.



I am using nutch-1.2. Is solr included as indexer by default or should I

configure it?







Regards,



Marseld







p class=MsoNormalspan style=color: rgb(31, 73, 125);Gjeni bPuneuml; 

teuml; Mireuml;/b dhe bteuml; Mireuml; peuml;r Puneuml;/b... 

Vizitoni: a target=_blank 
href=http://www.punaime.al/;www.punaime.al/a/span/p

pa target=_blank href=http://www.punaime.al/;span 
style=text-decoration: 

none;img width=165 height=31 border=0 alt=punaime 

src=http://www.ikub.al/images/punaime.al_small.png; //span/a/p



Email has been scanned for viruses by Altman Technologies' email management 

service - www.altman.co.uk/emailsystems



Glasgow Caledonian University is a registered Scottish charity, number SC021474



Winner: Times Higher Education’s Widening Participation Initiative of the Year 

2009 and Herald Society’s Education Initiative of the Year 2009.

http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html



Winner: Times Higher Education’s Outstanding Support for Early Career 

Researchers of the Year 2010, GCU as a lead with Universities Scotland partners.

http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html







p class=MsoNormalspan style=color: rgb(31, 73, 125);Gjeni bPuneuml; 

teuml; Mireuml;/b dhe bteuml; Mireuml; peuml;r Puneuml;/b... 

Vizitoni: a target=_blank 
href=http://www.punaime.al/;www.punaime.al/a/span/p

pa target=_blank href=http://www.punaime.al/;span 
style=text-decoration: 

none;img width=165 height=31 border=0 alt=punaime 

src=http://www.ikub.al/images/punaime.al_small.png; //span/a/p








 


will nutch-2 be able to index image files

2011-03-08 Thread alxsss
Hello,

I wondered if nutch version 2 be able to index image files?

Thanks.
Alex.


Re: will nutch-2 be able to index image files

2011-03-08 Thread alxsss
I meant to extract image title, src link and alt from img tags and not store 
image files. For a keyword search in must display link, which automatically 
displays image itself in the search page.
Not sure what do you mean image content-based retrieval? Do image files have 
tags like mp3 ones?
Must  a parse plugin be written in both cases?

Thanks.
Alex.


 

 

-Original Message-
From: Andrzej Bialecki a...@getopt.org
To: user user@nutch.apache.org
Sent: Tue, Mar 8, 2011 12:58 pm
Subject: Re: will nutch-2 be able to index image files


On 3/8/11 9:09 PM, alx...@aim.com wrote:

 Hello,



 I wondered if nutch version 2 be able to index image files?



In what way? Extract metadata and index image metadata as text? Sure, if 

we implement a plugin for it. Tika already supports EXIF, so this 

shouldn't be complicated, perhaps it's a tweak to the parse-tika 

configuration. Or did you mean the image content-based retrieval (e.g. 

using wavelets)?



-- 

Best regards,

Andrzej Bialecki 

  ___. ___ ___ ___ _ _   __

[__ || __|__/|__||\/|  Information Retrieval, Semantic Web

___|||__||  \|  ||  |  Embedded Unix, System Integration

http://www.sigram.com  Contact: info at sigram dot com






 


Re: nutch crawl command takes 98% of cpu

2011-03-14 Thread alxsss
Hello,

Which version this patch  is applicable?

Thanks.
Alex.

 

 


 

 

-Original Message-
From: Alexis alexis.detregl...@gmail.com
To: user user@nutch.apache.org
Sent: Tue, Feb 8, 2011 9:59 am
Subject: Re: nutch crawl command takes 98% of cpu


Hi,



Thanks for all the feedback. It looks like there is not much you can

do if you give the FLV parser some corrupted data. From a practical

point of view, we can say that this is extremely annoying as it takes

up all the CPU resources and prevent other threads to perform their

task properly, till the TIMEOUT occurs, kills the thread and frees up

the CPU.



We can notice that this happens when an FLV file is truncated (due to

an http.content.limit property lower that its content-length, in

bytes). So the suggestion is to hint to the parser that it is likely

to get stuck and skip the parsing in case the downloaded content size

mismatches the content-length header.



Besides, I often see errors in the HTML parser when the content is

truncated (https://issues.apache.org/jira/browse/TIKA-307). So it does

not hurt saving time and avoiding errors.



I created the issue here: https://issues.apache.org/jira/browse/NUTCH-965

See attached patch.



Alexis.



On Mon, Feb 7, 2011 at 12:00 PM, Ken Krugler

kkrugler_li...@transpac.com wrote:

 Hi Kirby  others,



 On Jan 31, 2011, at 4:39pm, Kirby Bohling wrote:



 On Sat, Jan 29, 2011 at 9:03 AM, Ken Krugler

 kkrugler_li...@transpac.com wrote:



 Some comments below.



 On Jan 29, 2011, at 5:55am, Julien Nioche wrote:



 Hi,



 This shows the state of the various threads within a Java process. Most

 of

 them seem to be busy parsing zip archives with Tika. The interesting

 part

 is

 that the main thread is at the Generation step :



 *  at org.apache.nutch.crawl.Generator.generate(Generator.java:431)

  at org.apache.nutch.crawl.Crawl.main(Crawl.java:127)

 *

 with the Thread-415331 normalizing the URLs as part of the generation.



 So why do we see threads busy at parsing these archives? I think this is

 a

 result of the Timeout mechanism (

 https://issues.apache.org/jira/browse/NUTCH-696) used for the parsing.

 Before it, we used to have the parsing step loop on a single document

 and

 never complete. Thanks to Andrzej's patch, the parsing is done is

 separate

 threads which are abandonned if more than X seconds have passed (default

 30

 I think). Obiously these threads are still lurking around in the

 background

 and consuming CPU.



 This is an issue when calling the Crawl command only. When using the

 separate commands for the various steps, the runaway threads die with

 the

 main process, however since the Crawl uses a single process, these

 timeout

 threads keep going.



 Am not an expert in multithreading and don't have an idea of whether

 these

 threads could be killed somehow. Andrzej, any clue?



 This is a fundamental problem with run-away threads - there is no safe,

 reliable way to kill them off.



 And if you parse enough documents, you will run into a number that

 currently

 cause Tika to hang. Zip files for sure, but we ran into the same issue

 with

 FLV files.



 Over in Tika-land, Jukka has a patch that fires up a child JVM and runs

 parsers there. See https://issues.apache.org/jira/browse/TIKA-416



 -- Ken





 All,



  Just an observation, but the general approach to this problem is to

 use Thread.interrupt().  Virtually all code in the JDK treats the

 thread being interrupted as a request to cancel.  Java Concurrency in

 Practice (JCIP) has a whole chapter on this topic (Chapter 7).  IMHO,

 any general purpose library code that swallows InterruptedException

 and isn't implementing the Thread cancellation policy has a bug in it

 (the cancellation policy can only be implemented by the owner of the

 thread, unless the library is a task/thread library it cannot be

 implementing the cancellation policy).  Any place you see:



 [snip]



 One exception is that

 sockets read/write operations don't operate this way, the socket must

 be closed to interrupt a read/write, the approach JCIP suggests is to

 tie the socket and thread in such a way that interrupt() closes the

 sockets that would be reading/writing inside that thread.



 Excellent input, as I need to solve some issues with needing to abort HTTP

 requests.



 [snip]



 Not sure exactly what the problems inside of Tika are, but getting it

 to respect interruption would be a wonderful thing for everybody that

 uses it.  The problem might be getting all underlying libraries it

 uses to do so.



 Yes, that's exactly the issue in the cases I've seen. The libraries used to

 do the actual parsing can get caught in loops, when processing unexpected

 data. There's no checks for interrupt, e.g. it's code that is walking some

 data structure, and doesn't realize that it's in a loop (e.g. offset to next

 chunk is set to zero, so the same chunk is endlessly 

skip Urls regex

2011-03-17 Thread alxsss
Hello

I see in nutch-1.2/conf/regex-urlfilter.txt file the following lines

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

However, nutch fetch urls like
http://www.example.com/text/dev/faq/dev/content/2305/dev/content/246/

Thanks.
Alex.


Re: Problem with Gora dependencies in trunk

2011-03-17 Thread alxsss
Hi,

If you donwload gora and build it with ant you get rid of the one of the 
dependency

--unresolved dependency: org.apache.gora#gora-core;0.1: not found

if you change gora version from 1.0 to 1.0-incubator in one of the ivy files 
but this one 

--unresolved dependency: org.apache.gora#gora-sql;0.1: not found
stays because gora itself does not build successfuly. 
It has also some other dependencies that I was unable to locate yet.


Good luck,

Alex.





 

 


 

 

-Original Message-
From: McGibbney, Lewis John lewis.mcgibb...@gcu.ac.uk
To: user user@nutch.apache.org
Sent: Thu, Mar 17, 2011 3:12 pm
Subject: Problem with Gora dependencies in trunk


Hi list,



OK I have seen quite a few threads on this topic as well as a couple of 
comments 

appended to the blog entries provided on the wiki. I also posted on this a 
while 

back but unfortunately got no reply so thought best thing to do was persist and 

see if I could solve the issue... how wrong I was. I have followed in minute 

detail the building nutch 2.0 in eclipse blog entry



I'm getting the following after attempting to add Ivy library to ivy/ivy.xml

Impossible to resolve dependencies of 
org.apache.nutch#${ant.project.name};working@lewis-01

  unresolved dependency: org.apache.gora#gora-core;0.1: not found

  unresolved dependency: org.apache.gora#gora-sql;0.1: not found

  unresolved dependency: org.apache.gora#gora-core;0.1: not found

  unresolved dependency: org.apache.gora#gora-sql;0.1: not found

  unresolved dependency: org.apache.gora#gora-core;0.1: not found

  unresolved dependency: org.apache.gora#gora-sql;0.1: not found



I read here http://www.mail-archive.com/user@nutch.apache.org/msg01515.html 
that 

there WAS a problem with Nutch wrongly assuming Gora artifacts, but that it has 

since been resolve so I am really stumped.

Any comments would be appreciated.

Thank you Lewis









Glasgow Caledonian University is a registered Scottish charity, number SC021474



Winner: Times Higher Education’s Widening Participation Initiative of the Year 

2009 and Herald Society’s Education Initiative of the Year 2009.

http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html



Winner: Times Higher Education’s Outstanding Support for Early Career 

Researchers of the Year 2010, GCU as a lead with Universities Scotland partners.

http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html




 


Re: Problem with Gora dependencies in trunk

2011-03-17 Thread alxsss
Hi,

Did you build gora with ant? I checked out from svn a few days ago and ant for 
gora gives error

::
[ivy:resolve]   ::  UNRESOLVED DEPENDENCIES ::
[ivy:resolve]   ::
[ivy:resolve]   :: com.sun.jersey#jersey-core;1.4: not found
[ivy:resolve]   :: com.sun.jersey#jersey-json;1.4: not found
[ivy:resolve]   :: com.sun.jersey#jersey-server;1.4: not found


Thanks.
Alex.

 

 


 

 

-Original Message-
From: Markus Jelsma markus.jel...@openindex.io
To: user user@nutch.apache.org
Cc: McGibbney, Lewis John lewis.mcgibb...@gcu.ac.uk
Sent: Thu, Mar 17, 2011 3:21 pm
Subject: Re: Problem with Gora dependencies in trunk


I had issues as well some while ago but i updated to the latest trunk revision 

a few weeks ago. I first built Gora's checkout and after that ant was doing 

well with Nutch. No need to change Ivy anymore.





 Hi list,

 

 OK I have seen quite a few threads on this topic as well as a couple of

 comments appended to the blog entries provided on the wiki. I also posted

 on this a while back but unfortunately got no reply so thought best thing

 to do was persist and see if I could solve the issue... how wrong I was. I

 have followed in minute detail the building nutch 2.0 in eclipse blog

 entry

 

 I'm getting the following after attempting to add Ivy library to

 ivy/ivy.xml Impossible to resolve dependencies of

 org.apache.nutch#${ant.project.name};working@lewis-01 unresolved

 dependency: org.apache.gora#gora-core;0.1: not found

   unresolved dependency: org.apache.gora#gora-sql;0.1: not found

   unresolved dependency: org.apache.gora#gora-core;0.1: not found

   unresolved dependency: org.apache.gora#gora-sql;0.1: not found

   unresolved dependency: org.apache.gora#gora-core;0.1: not found

   unresolved dependency: org.apache.gora#gora-sql;0.1: not found

 

 I read here http://www.mail-archive.com/user@nutch.apache.org/msg01515.html

 that there WAS a problem with Nutch wrongly assuming Gora artifacts, but

 that it has since been resolve so I am really stumped. Any comments would

 be appreciated.

 Thank you Lewis

 

 

 

 

 Glasgow Caledonian University is a registered Scottish charity, number

 SC021474

 

 Winner: Times Higher Education’s Widening Participation Initiative of the

 Year 2009 and Herald Society’s Education Initiative of the Year 2009.

 http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,

 en.html

 

 Winner: Times Higher Education’s Outstanding Support for Early Career

 Researchers of the Year 2010, GCU as a lead with Universities Scotland

 partners.

 http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691

 ,en.html




 


Re: Script failing when arriving at 'Solr' commands

2011-04-07 Thread alxsss
It seems to me that you may have the same problem as before with the disk 
space. This may happen because you do mergesegs. Try not to merge segments.

Alex.

 

 


 

 

-Original Message-
From: McGibbney, Lewis John lewis.mcgibb...@gcu.ac.uk
To: user user@nutch.apache.org
Sent: Wed, Apr 6, 2011 12:55 pm
Subject: Script failing when arriving at 'Solr' commands


Hi list,

The last week has been a real hang up and I have made very little progress so 
excuse this lengthy post. Using branch-1.3. My script contains following 
commands

1.inject
2.generate
   fetch
   parse
   updatedb
3.mergesegs
4.inverlinks
5.solrindex
6.solrdedup
7.solrclean
8.load new index

The script is running fine until solrindex stage and this output

LinkDb: starting at 2011-04-06 20:25:40
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: 
file:/home/lewis/workspace/branch-1.3/runtime/local/crawl/segments/20110406202533
LinkDb: merging with existing linkdb: crawl/linkdb
LinkDb: finished at 2011-04-06 20:25:44, elapsed: 00:00:03
- SolrIndex (Step 5 of 8) -
SolrIndexer: starting at 2011-04-06 20:25:45
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
file:/home/lewis/workspace/branch-1.3/runtime/local/crawl/linkdb/crawl_fetch
Input path does not exist: 
file:/home/lewis/workspace/branch-1.3/runtime/local/crawl/linkdb/crawl_parse
Input path does not exist: 
file:/home/lewis/workspace/branch-1.3/runtime/local/crawl/linkdb/parse_data
Input path does not exist: 
file:/home/lewis/workspace/branch-1.3/runtime/local/crawl/linkdb/parse_text
Input path does not exist: 
file:/home/lewis/workspace/branch-1.3/runtime/local/crawl/NEWindexes/current
- SolrDedup (Step 6 of 8) -
Usage: SolrDeleteDuplicates solr url
- SolrClean (Step 7 of 8) -
SolrClean: starting at 2011-04-06 20:25:47
Exception in thread main java.io.IOException: No FileSystem for scheme: http
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:169)
at 
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.indexer.solr.SolrClean.delete(SolrClean.java:168)
at org.apache.nutch.indexer.solr.SolrClean.run(SolrClean.java:180)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.solr.SolrClean.main(SolrClean.java:186)

Having inspected the linkdb I can see a directory named 'current', which in 
turn 
contains a 'part-0' directory which contains two files named 'data' and 
'index'... as far as I am aware this is identical when I used Nutch-1.2. A 
couple of points to note about my recent discovery's and thought's

1. I was having problems with script with a similar 'Input path does not exist' 
error until I added the hadoop.tmp.dir property as a HDD partition to 
nutch-site, this seemed to solve the problem.
2. I am aware that it is maybe not necessary (and possibly not best practice 
for 
some situations) to include an invertlinks command prior to indexing, however 
this has always been my practice and has always provided great results when I 
was using the legacy Lucene indexing within Nutch-1.2, therefore I am curious 
to 
understand if it is this command which is knocking off the solrindexer
3.Is it a possibility that there is a similar property such as solr.tmp.dir I 
need to set which I am missing and this is knocking solrindexer off?
4. Even after solrindexer kicks in, the solrdedup output does not appear to be 
responding correctly, this is shadowed by solrclean so I am definitely doing 
something wrong here, however I am unfamiliar with the IOException No 
FileSystem 
for scheme: http.

I understand that this post may seem a bit epic, but from the information I 
have 
E.g. logs, terminal output and user-lists I am stumped. I'm therefore looking 
for guys with more experience to possibly lend a hand. I can provide additional 
command parameters if this is of value.

Thanks in advance for any help Lewis




Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education’s Widening Participation 

Re: will nutch-2 be able to index image files

2011-04-22 Thread alxsss
Hello,

Looks like I will have some spare time in the next month, so I may work on 
writing this image indexing plugin. I wondered if there is a similar plugin to 
leverage code from or follow it?

Thanks.
Alex.

 

 


 

 

-Original Message-
From: Andrzej Bialecki a...@getopt.org
To: user user@nutch.apache.org
Sent: Wed, Mar 9, 2011 12:24 am
Subject: Re: will nutch-2 be able to index image files


On 3/8/11 10:50 PM, alx...@aim.com wrote:
 I meant to extract image title, src link and alt fromimg tags and not store 
image files. For a keyword search in must display link, which automatically 
displays image itself in the search page.
 Not sure what do you mean image content-based retrieval? Do image files have 
tags like mp3 ones?

Yes, for example http://en.wikipedia.org/wiki/Exchangeable_image_file_format

 Must  a parse plugin be written in both cases?

Yes - most data is already available either in the DOM tree, or can be 
obtained from a Tika image parser, it just needs to be wrapped in a plugin.


-- 
Best regards,
Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


 


Re: Hosts File Nutch 1.0+

2011-04-26 Thread alxsss
It seems you should move www.example.com example.com from line 3 to line 1, 
uncomment line 3 and comment other lines.

Alex.

 

 


 

 

-Original Message-
From: Alex alex.thegr...@ambix.net
To: user user@nutch.apache.org
Sent: Tue, Apr 26, 2011 4:18 am
Subject: Re: Hosts File  Nutch 1.0+


Just in case someone has more ideas.  Here is how my hosts file look  
like:

http://pastebin.com/wyV7wnqn

Any help is highly appreciated!

Alex


On Apr 25, 2011, at 10:13 PM, Alex wrote:

 Dear Mark:

 Thank you so much for the help!

 I tried it but it still give me the same error.

 According to the developer is either a server environment for not  
 able to search itself or host file issue.


 Any other ideas?

 Thank you so much for your time!

 Alex



 On Apr 19, 2011, at 6:01 PM, Mark Achee wrote:

 With nslookup already showing the correct IP address, it doesn't  
 seem like a
 hostname/DNS issue.  But I assume this is what the developer is  
 talking
 about:

 At the end of your /etc/hosts file add

 127.0.0.1  www.example.org

 but replace www.example.org with your domain.  If you know what the  
 server's
 other IP address(es) is/are, you could try those also instead of  
 127.0.0.1.
 If that doesn't fix it, it's probably not really a hostname/DNS  
 issue.



 -Mark


 On Tue, Apr 19, 2011 at 6:47 PM, Alex alex.thegr...@ambix.net  
 wrote:

 I edited that so that it does not disclose the location of my
 rootUrLDir.  The path is accurate.

 I am going to find out what command is given to nutch but basically
 the application developer has confirmed that the issue is the hosts
 file or something on the server that can not search itself.

 Alex
 On Apr 19, 2011, at 5:22 PM, Mark Achee wrote:

 From your logs:

 INFO sitesearch.CrawlerUtil: rootUrlDir = /path/to/directory/


 Looks like you didn't set the seed urls directory.  If that's not
 enough
 info for you to fix it, send the full command you're running.

 -Mark



 On Thu, Apr 14, 2011 at 10:57 PM, Alex alex.thegr...@ambix.net
 wrote:

 Hi,

 I am new to Nutch.  I have an application that uses Nutch to  
 search.
 I have configured the application so that Nutch can run.  However,
 after a lot of troubleshooting I have been pointed to the fact  
 that
 there is something wrong with my hosts file.  My hostname is
 different
 than my domain name and that seems to make Nutch stop in depth  
 1.
 Does anyone have any idea of what is the correct configuration  
 of the
 hosts file so that nutch runs properly?

 My domain name resolves fine.  Please help me!

 Here are the logs of the indexing:

 Stopping at depth=1 - no more URLs to fetch.

 INFO sitesearch.CrawlerUtil: indexHost : Starting an Site Search
 index on host www.mydomain.com
 INFO sitesearch.CrawlerUtil: site search crawl started in: /opt/
 dotcms/
 dotCMS/assets/search_index/www.mydomain.com/1-XXX_temp/crawl-index
 ] INFO sitesearch.CrawlerUtil: rootUrlDir = /path/to/directory/
 search_index/www.mydomain.com/url_folder
 INFO sitesearch.CrawlerUtil: threads = 10
 INFO sitesearch.CrawlerUtil: depth = 20
 INFO sitesearch.CrawlerUtil: indexer=lucene

 INFO sitesearch.CrawlerUtil: Stopping at depth=1 - no more URLs to
 fetch.
 NFO sitesearch.CrawlerUtil: site search crawl finished: /
 directorypath/
 search_index/www.mydomain.com/1xxx/crawl-index
 INFO sitesearch.CrawlerUtil: indexHost : Finished Site Search  
 index
 on
 host www.mydomain.com





 


keeping index up to date

2011-06-01 Thread alxsss
Hello,

I use nutch-1.2 to index about 3000 sites. One of them has about 1500 pdf files 
which do not change over time. 
I wondered if there is a way of configuring nutch not to fetch unchanged 
documents again and again, but keep the old index for them.


Thanks.
Alex.


Re: keeping index up to date

2011-06-07 Thread alxsss

 

 Hi,

I took a look to the  recrawl script and noticed that all the steps except urls 
injection are repeated at the consequent indexing and wondered why would we 
generate new segments?
Is it possible to do fetch, update for all previous $s1..$sn , invertlink  and 
index steps.

Thanks.
Alex.


 

 

-Original Message-
From: Julien Nioche lists.digitalpeb...@gmail.com
To: user user@nutch.apache.org
Sent: Wed, Jun 1, 2011 12:59 am
Subject: Re: keeping index up to date


You should use the adaptative fetch schedule. See
http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/
http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/%20for
details

On 1 June 2011 07:18, alx...@aim.com wrote:

 Hello,

 I use nutch-1.2 to index about 3000 sites. One of them has about 1500 pdf
 files which do not change over time.
 I wondered if there is a way of configuring nutch not to fetch unchanged
 documents again and again, but keep the old index for them.


 Thanks.
 Alex.




-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

 


ranking of search results

2011-07-22 Thread alxsss
Hello,

I use nutch 1.2 and solr to index about 3500 domains. I noticed that search 
results for two or more keywords are not ranked properly.
For example for keyword Lady Gaga some results that has Lady are displayed 
first then some results with both keywords and etc. It seems to me that results 
with both words must be displayed in the first place and those with one of the 
keywords must follow them.

Any idea how to correct this.

Thanks.
Alex.


Re: keeping index up to date

2011-07-26 Thread alxsss

 

 Hello,

One more question. Is there a way of adding new urls to crawldb created in 
previous crawls to include in subsequent recrawls? 

Thanks.
Alex. 

 

-Original Message-
From: lewis john mcgibbney lewis.mcgibb...@gmail.com
To: user user@nutch.apache.org; markus.jelsma markus.jel...@openindex.io
Sent: Tue, Jun 7, 2011 1:16 pm
Subject: Re: keeping index up to date


Hi,

To add to Markus' comments, if you take a look at the script it is written
in such a way that if run in safe mode it protects us against an error which
may occur. If this is the case we an recover segments etc and take
appropriate actions to resolve.

On Tue, Jun 7, 2011 at 9:01 PM, Markus Jelsma markus.jel...@openindex.iowrote:


   Hi,
 
  I took a look to the  recrawl script and noticed that all the steps
 except
  urls injection are repeated at the consequent indexing and wondered why
  would we generate new segments? Is it possible to do fetch, update for
 all
  previous $s1..$sn , invertlink  and index steps.

 No, the generater generates a segment with a list of URL for the fetcher to
 fetch. You can, if you like, then merge segments.

 
  Thanks.
  Alex.
 
 
 
 
 
 
  -Original Message-
  From: Julien Nioche lists.digitalpeb...@gmail.com
  To: user user@nutch.apache.org
  Sent: Wed, Jun 1, 2011 12:59 am
  Subject: Re: keeping index up to date
 
 
  You should use the adaptative fetch schedule. See
  http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/
  http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/%20
 for
  details
 
  On 1 June 2011 07:18, alx...@aim.com wrote:
   Hello,
  
   I use nutch-1.2 to index about 3000 sites. One of them has about 1500
 pdf
   files which do not change over time.
   I wondered if there is a way of configuring nutch not to fetch
 unchanged
   documents again and again, but keep the old index for them.
  
  
   Thanks.
   Alex.




-- 
*Lewis*

 


Re: solrindex command` not working

2011-07-26 Thread alxsss
check for errors in solr log.
 

-Original Message-
From: Way Cool way1.wayc...@gmail.com
To: user user@nutch.apache.org
Sent: Tue, Jul 26, 2011 3:14 pm
Subject: Re: solrindex command` not working


The latest solr version is 3.3. Maybe you can try that.


On Tue, Jul 26, 2011 at 2:10 AM, Marseld Dedgjonaj 
marseld.dedgjo...@ikubinfo.com wrote:

 Hello list,

 I am having trouble using nutch + solr in index step.

 I use nutch 1.2 and solr 1.3.

 When I execute command:

 $NUTCH_HOME/bin/nutch solrindex http://127.0.0.1:8983/solr/main/
 $crawldir/crawldb $crawldir/linkdb $crawldir/segments/*

 I got 2011-07-25 20:11:59,702 ERROR solr.SolrIndexer -
 java.io.IOException:
 Job failed! and no index passed to the SOLR.



 Any Idea what I am doing wrong.



 Thanks in advance,

 Marseld





 p class=MsoNormalspan style=color: rgb(31, 73, 125);Gjeni
 bPuneuml; teuml; Mireuml;/b dhe bteuml; Mireuml; peuml;r
 Puneuml;/b... Vizitoni: a target=_blank href=http://www.punaime.al/
 www.punaime.al/a/span/p
 pa target=_blank href=http://www.punaime.al/;span
 style=text-decoration: none;img width=165 height=31 border=0
 alt=punaime src=http://www.ikub.al/images/punaime.al_small.png;
 //span/a/p


 


ranking in nutch/solr results

2011-07-30 Thread alxsss
Hello,

I use nutch-1.2 with solr 1.4. Recently, I noticed that for search for a domain 
name, for example yahoo.com, yahoo.com is not in the first place. Instead other 
sites that has in content yahoo.com, are in the first places. I tested this 
issue with google. In its results domain is in the first place.

Any idea how to fix this in Nutch/Solr results.

Thanks in advance.
Alex.


Re: nutch redirect treatment

2011-08-17 Thread alxsss
https://issues.apache.org/jira/browse/NUTCH-1044
 

 


 

 

-Original Message-
From: abhayd ajdabhol...@hotmail.com
To: nutch-user nutch-u...@lucene.apache.org
Sent: Wed, Aug 17, 2011 11:44 am
Subject: nutch redirect treatment


hi 
I have seen similar posts in this forum but still not able to understand how
redirect is handled..

I m trying to crawl http://developer.att.com/developer/ . After successful
crawl i dump the crawldb using readdb. I see entries like following.  What
does this mean? Has nutch crawled the redirected page and is it in index?

 I tried using readseg command  with all the segments under crawl/segments
directory but i could not find 
http://developer.att.com/developer/tier1page.jsp?passedItemId=16_requestid=35037
url.

heres is my crawl/segments directory listing.
20110817001833  20110817002117  20110817003028  20110817003930 
20110817004202
20110817001844  20110817002556  20110817003532  20110817004105

Any help why redirected page is not crawled?

http://developer.att.com/developer/ Version: 7
Status: 4 (db_redir_temp)
Fetch time: Fri Sep 16 00:18:36 CDT 2011
Modified time: Wed Dec 31 18:00:00 CST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _pst_: temp_moved(13), lastModified=0:
http://developer.att.com/developer/tier1page.jsp?passedItemId=16_requestid=35037

http://developer.att.com/developer/16   Version: 7
Status: 5 (db_redir_perm)
Fetch time: Fri Sep 16 00:43:33 CDT 2011
Modified time: Wed Dec 31 18:00:00 CST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.0
Signature: null
Metadata: _pst_: moved(12), lastModified=0:
http://developer.att.com/developer/forward.jsp?passedItemId=16



--
View this message in context: 
http://lucene.472066.n3.nabble.com/nutch-redirect-treatment-tp3261546p3261546.html
Sent from the Nutch - User mailing list archive at Nabble.com.

 


Re: nutch redirect treatment

2011-08-17 Thread alxsss
As far as I understood redirected urls are scored 0 and that is why fetcher 
does not pick them up in the earlier depths. They may be crawled starting depth 
4  depending on the size of the seed list.

 

 

-Original Message-
From: abhayd ajdabhol...@hotmail.com
To: nutch-user nutch-u...@lucene.apache.org
Sent: Wed, Aug 17, 2011 4:41 pm
Subject: Re: nutch redirect treatment


thanks for response.

But my issue is after redirect new url is not being crawled. Not a scoring
issue.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/nutch-redirect-treatment-tp3261546p3263311.html
Sent from the Nutch - User mailing list archive at Nabble.com.

 


Re: fetcher runs without error with no internet connection

2011-08-23 Thread alxsss
Hi Lewis,

I stopped fetcher and started it on the same segment again. 
But before doing that I turned off modem and fetcher started giving 
Unknown.Host exception.
It was not giving any error, with dsl failure, i.e. I was not able to connect 
to any sites. Again this is nutch-1.2.

Thanks.
Alex.

 

 

-Original Message-
From: lewis john mcgibbney lewis.mcgibb...@gmail.com
To: user user@nutch.apache.org
Sent: Tue, Aug 23, 2011 6:37 am
Subject: Re: fetcher runs without error with no internet connection


Hi Alex,

Did you get anywhere with this?

What condition led to you seeing unknown host exception?

Unless segment gets corrupted, I would assume you could fetch again.
Hopefully you can confirm this.

On Tue, Aug 16, 2011 at 9:23 PM, alx...@aim.com wrote:

 Hello,

 After running bin/nutch fetch $segment for 2 days, internet connection was
 lost, but nutch did not give any errors. Usually I was seeing Unknown host
 exception before.
 Any ideas what happened and is it OK to stop the fetch and run it again on
 the same (old) segment? This is nutch -1.2

 Thanks.
 Alex.




-- 
*Lewis*

 


Re: fetcher runs without error with no internet connection

2011-08-30 Thread alxsss
It is the DNS problem, because it was giving a lot of UnknownHost exception. I 
decreased thread number to 5, but still DSL fails periodically. 
I wondered what is the common internet connection for fetching about 3500 
domains. I currently have DSL with 3 Mps.

Thanks.
Alex.

 

-Original Message-
From: Markus Jelsma markus.jel...@openindex.io
To: user user@nutch.apache.org
Sent: Mon, Aug 29, 2011 5:19 pm
Subject: Re: fetcher runs without error with no internet connection


I didn't say you have a DNS-problem only that these exception may occur if the 
DNS can't keep up with the requests you make. Make sure you have a DNS problem 
before trying to solve a problem that doesn't exist. It's normal to have these 
exceptions once in a while.

Solving DNS issues are beyond the scope of this list. You may, however, opt 
for some DNS caching in your network.

 What is the solution to the issue with DNS server?
 
 
 
 
 
 -Original Message-
 From: Markus Jelsma markus.jel...@openindex.io
 To: user user@nutch.apache.org
 Sent: Tue, Aug 23, 2011 12:32 pm
 Subject: Re: fetcher runs without error with no internet connection
 
 
 If you fetch too hard, your DNS-server may not be able to keep up.
 
  Hi Lewis,
  
  I stopped fetcher and started it on the same segment again.
  But before doing that I turned off modem and fetcher started giving
  Unknown.Host exception. It was not giving any error, with dsl failure,
  i.e. I was not able to connect to any sites. Again this is nutch-1.2.
  
  Thanks.
  Alex.
  
  
  
  
  
  -Original Message-
  From: lewis john mcgibbney lewis.mcgibb...@gmail.com
  To: user user@nutch.apache.org
  Sent: Tue, Aug 23, 2011 6:37 am
  Subject: Re: fetcher runs without error with no internet connection
  
  
  Hi Alex,
  
  Did you get anywhere with this?
  
  What condition led to you seeing unknown host exception?
  
  Unless segment gets corrupted, I would assume you could fetch again.
  Hopefully you can confirm this.
  
  On Tue, Aug 16, 2011 at 9:23 PM, alx...@aim.com wrote:
   Hello,
   
   After running bin/nutch fetch $segment for 2 days, internet connection
   was lost, but nutch did not give any errors. Usually I was seeing
   Unknown host exception before.
   Any ideas what happened and is it OK to stop the fetch and run it again
   on the same (old) segment? This is nutch -1.2
   
   Thanks.
   Alex.

 


spellchecking in nutch solr

2011-09-01 Thread alxsss
Hello,

I have tried to implement spellchecker based on index in nutch-solr by adding 
spell field to schema.xml and making it a copy from content field. However, 
this increased data folder size twice and spell filed as a copy of content 
field appears in xml feed which is not necessary. Is it possible to implement 
spellchecker without this issue?

Thanks.
Alex.


Re: Crawl fails - Input path does not exist

2011-09-13 Thread alxsss
Comparing with nutch-1.2 I do not see any content folder under segments ones.
Does this mean that we cannot put store.content to false in nutch1-3? 

Thanks.
Alex.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Crawl-fails-Input-path-does-not-exist-tp996823p3334709.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: more from link

2011-09-14 Thread alxsss
I see what is in done in nutch results. Results are grouped with 1 doc in each 
group. I need to group with 3 max docs in each group.
In Solr, it is impossible to paginate when grouping with more than 1 doc in 
each group.

Google can do it with 5 docs in the first group, as I see.

Thanks.
Alex.

 

 

-Original Message-
From: Markus Jelsma markus.jel...@openindex.io
To: user user@nutch.apache.org
Sent: Wed, Sep 14, 2011 2:24 am
Subject: Re: more from  link


Field collpase on site or host field.

  Hello,
 
 In nutch search page there is more from link in case when there are many
 results from the same site. Is there a way to have this kind of link in
 case when Solr is used as front end.?
 
 Thanks.
 Alex.

 


restart a failed job

2011-09-20 Thread alxsss

 

 Hello,

I wondered if it is possible to restart a failed job in nutch-1.3 version.
I have this error 

org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
taskTracker/jobcache/



after fetching for 5 days. I know the reason for the error, but do not want to 
restart the whole process from the beginning. I use nutch in local mode in one 
machine.

Thanks.
Alex.




fetch command does not parse

2011-09-22 Thread alxsss

 

 Hello,

I tried fetch command with the following config

 property
namefetcher.store.content/name
  valuefalse/value
  descriptionIf true, fetcher will store content./description
  /property
property
  namefetcher.parse/name
  valuetrue/value
  descriptionIf true, fetcher will parse content. Default is false, which 
means
  that a separate parsing step is required after fetching is 
finished./description
/property

However, fetcher did not parse. There is no parse folders under the segment and 
updatedb gives errors.
I wonder how to crawl without storing content in this version 1.3?

Thanks.
Alex.





Re: Removing urls from crawl db

2011-11-01 Thread alxsss
I think you must add a regex to regex-urlfilter.txt . In that case those urls 
will not be fetched by fetcher.
 

-Original Message-
From: Bai Shen baishen.li...@gmail.com
To: user user@nutch.apache.org
Sent: Tue, Nov 1, 2011 10:35 am
Subject: Re: Removing urls from crawl db


Already did that.  But it doesn't allow me to delete urls from the list to
be crawled.

On Tue, Nov 1, 2011 at 5:56 AM, Ferdy Galema ferdy.gal...@kalooga.comwrote:

 As for reading the crawldb, you can use 
 org.apache.nutch.crawl.**CrawlDbReader.
 This allows for dumping the crawldb into a readable textfile as well as
 querying individual urls. Run without args to see its usage.


 On 10/31/2011 08:47 PM, Markus Jelsma wrote:

 Hi

 Write an regex URL filter and use it the next time you update the db; it
 will
 disappear. Be sure to backup the db first in case your regex catches valid
 URL's. Nutch 1.5 will have an option to keep the previous version of the
 DB
 after update.

 cheers

  We accidentally injected some urls into the crawl database and I need to
 go
 remove them.  From what I understand, in 1.4 I can view and modify the
 urls
 and indexes.  But I can't seem to find any information on how to do this.

 Is there anything regarding this available?



 


Re: how use NUTCH-16 in my nutch 1.3?

2011-11-03 Thread alxsss
I think this patch already included in the current version. 
 

-Original Message-
From: mina tahereganji...@gmail.com
To: nutch-user nutch-u...@lucene.apache.org
Sent: Wed, Nov 2, 2011 7:08 pm
Subject: how use NUTCH-16 in my nutch 1.3?


i want to use NUTCH-61 in  http://issues.apache.org/jira/browse/NUTCH-61
https://issues.apache.org/jira/browse/NUTCH-61

but i don't know how use that and use it in my nutch 1.3? help me.


--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-use-NUTCH-16-in-my-nutch-1-3-tp3473096p3473096.html
Sent from the Nutch - User mailing list archive at Nabble.com.

 


Re: Fetching just some urls outside domain

2011-12-01 Thread alxsss
Hello,

It is interesting to know how can one put a filter on outlinks? I mean if I 
have a regex, in which file should I put it?
For example, I want nutch to ignore outlinks ending with .info.

Thanks.
Alex.

 

 

 

-Original Message-
From: Arkadi.Kosmynin arkadi.kosmy...@csiro.au
To: user user@nutch.apache.org
Sent: Thu, Dec 1, 2011 1:44 pm
Subject: RE: Fetching just some urls outside domain


Hi Adriana,

You can try Arch for this:

http://www.atnf.csiro.au/computing/software/arch

You can configure it to crawl your web sites plus sets of miscellaneous URLs 
called bookmarks in Arch. Arch is a free extension of Nutch. Right now, only 
Arch based on Nutch 1.2 is available for downloading. We are about to release 
Arch based on Nutch 1.4.

Regards,

Arkadi



 -Original Message-
 From: Adriana Farina [mailto:adriana.farin...@gmail.com]
 Sent: Thursday, 1 December 2011 7:58 PM
 To: user@nutch.apache.org
 Subject: Re: Fetching just some urls outside domain
 
 Hi!
 
 Thank you for your answer. You're right, maybe an example would explain
 better what I need to do.
 
 I have to perform the following task. I have to explore a specific
 domain (.
 gov.it) and I have an initial set of seeds, for example www.aaa.it,
 www.bbb.gov.it, www.ccc.it. I configured nutch so that it doesn't fetch
 pages outside that domain. However some resources I need to download
 (documents) are stored on web sites that are not inside the domain I'm
 interested in.
 For example: www.aaa.it/subfolder/albi redirects to www.somesite.it
 (where
 www.somesite.it is not inside my domain). Nutch will not fetch that
 page
 since I told it to behave that way, but I need to download documents
 stored
 on www.somesite.it. So I need nutch to go outside the domain I
 specified
 only when it sees the words albi or albo inside the url, since that
 words identify the documents I need. How can I do this?
 
 I hope I've been clear. :)
 
 
 
 2011/11/30 Lewis John Mcgibbney lewis.mcgibb...@gmail.com
 
  Hi Adriana,
 
  This should be achievable through fine grained URL filters. It is
 kindof
  hard to substantiate on this without you providing some examples of
 the
  type of stuff you're trying to do!
 
  Lewis
 
  On Mon, Nov 28, 2011 at 11:14 AM, Adriana Farina 
  adriana.farin...@gmail.com
   wrote:
 
   Hello,
  
   I'm using nutch 1.3 from just a month, so I'm not an expert. I
 configured
   it so that it doesn't fetch pages outside a specific domain.
 However now
  I
   need to let it fetch pages outside the domain I choosed but only
 for some
   urls (not for all the urls I have to crawl). How can I do this? I
 have to
   write a new plugin?
  
   Thanks.
  
 
 
 
  --
  *Lewis*
 

 


Re: Fetching just some urls outside domain

2011-12-01 Thread alxsss
If I understand you correctly, you state that even if my question is related to 
the current thread, nevertheless I must open a new one?
 

 

 

-Original Message-
From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com
To: user user@nutch.apache.org
Sent: Thu, Dec 1, 2011 3:01 pm
Subject: Re: Fetching just some urls outside domain


Nutch comes packed with quite a few url-filters out of the box. They just
need some tuning.

Have a look in NUTCH_HOME/conf

Also have a look at the corresponding plugins. Realistically you should
really start a new thread for new questions :0)

I think you're looking for the urlfilter-domain plugin

On Thu, Dec 1, 2011 at 10:48 PM, alx...@aim.com wrote:

 Hello,

 It is interesting to know how can one put a filter on outlinks? I mean if
 I have a regex, in which file should I put it?
 For example, I want nutch to ignore outlinks ending with .info.

 Thanks.
 Alex.







 -Original Message-
 From: Arkadi.Kosmynin arkadi.kosmy...@csiro.au
 To: user user@nutch.apache.org
 Sent: Thu, Dec 1, 2011 1:44 pm
 Subject: RE: Fetching just some urls outside domain


 Hi Adriana,

 You can try Arch for this:

 http://www.atnf.csiro.au/computing/software/arch

 You can configure it to crawl your web sites plus sets of miscellaneous
 URLs
 called bookmarks in Arch. Arch is a free extension of Nutch. Right now,
 only
 Arch based on Nutch 1.2 is available for downloading. We are about to
 release
 Arch based on Nutch 1.4.

 Regards,

 Arkadi



  -Original Message-
  From: Adriana Farina [mailto:adriana.farin...@gmail.com]
  Sent: Thursday, 1 December 2011 7:58 PM
  To: user@nutch.apache.org
  Subject: Re: Fetching just some urls outside domain
 
  Hi!
 
  Thank you for your answer. You're right, maybe an example would explain
  better what I need to do.
 
  I have to perform the following task. I have to explore a specific
  domain (.
  gov.it) and I have an initial set of seeds, for example www.aaa.it,
  www.bbb.gov.it, www.ccc.it. I configured nutch so that it doesn't fetch
  pages outside that domain. However some resources I need to download
  (documents) are stored on web sites that are not inside the domain I'm
  interested in.
  For example: www.aaa.it/subfolder/albi redirects to www.somesite.it
  (where
  www.somesite.it is not inside my domain). Nutch will not fetch that
  page
  since I told it to behave that way, but I need to download documents
  stored
  on www.somesite.it. So I need nutch to go outside the domain I
  specified
  only when it sees the words albi or albo inside the url, since that
  words identify the documents I need. How can I do this?
 
  I hope I've been clear. :)
 
 
 
  2011/11/30 Lewis John Mcgibbney lewis.mcgibb...@gmail.com
 
   Hi Adriana,
  
   This should be achievable through fine grained URL filters. It is
  kindof
   hard to substantiate on this without you providing some examples of
  the
   type of stuff you're trying to do!
  
   Lewis
  
   On Mon, Nov 28, 2011 at 11:14 AM, Adriana Farina 
   adriana.farin...@gmail.com
wrote:
  
Hello,
   
I'm using nutch 1.3 from just a month, so I'm not an expert. I
  configured
it so that it doesn't fetch pages outside a specific domain.
  However now
   I
need to let it fetch pages outside the domain I choosed but only
  for some
urls (not for all the urls I have to crawl). How can I do this? I
  have to
write a new plugin?
   
Thanks.
   
  
  
  
   --
   *Lewis*
  





-- 
*Lewis*

 


Re: how give several sites to nutch to crawl?

2011-12-03 Thread alxsss
I think you should add this to nutch-site.xml
property
  namegenerate.max.count/name
  value1000/value
  descriptionThe maximum number of urls in a single
  fetchlist.  -1 if unlimited. The urls are counted according
  to the value of the parameter generator.count.mode.
  /description
/property

 
and set topN to -1

Alex.

 

 

-Original Message-
From: mina tahereganji...@gmail.com
To: nutch-user nutch-u...@lucene.apache.org
Sent: Sat, Dec 3, 2011 6:10 pm
Subject: Re: how give several sites to nutch to crawl?


thanks for your answer. i use this script to crawl my sites:



$NUTCH_HOME/bin/nutch inject $NUTCH_HOME/bin/crawl1/crawldb

$NUTCH_HOME/bin/seedUrls

for((i=0; i  $depth; i++))

do

  echo --- Beginning crawl at depth `expr $i + 1` of $depth ---

$NUTCH_HOME/bin/nutch generate $NUTCH_HOME/bin/crawl1/crawldb

$NUTCH_HOME/bin/crawl1/segments $topN



 if [ $? -ne 0 ]

  then

echo deepcrawler: Stopping at depth $depth. No more URLs to fetch.

break

  fi

  segment1=`ls -d $NUTCH_HOME/bin/crawl1/segments/* | tail -1`



$NUTCH_HOME/bin/nutch fetch $segment1

  if [ $? -ne 0 ]

 then

echo deepcrawler: fetch $segment1 at depth `expr $i + 1` failed.

echo deepcrawler: Deleting segment $segment1.

rm $RMARGS $segment1

continue

  fi

$NUTCH_HOME/bin/nutch parse $segment1

$NUTCH_HOME/bin/nutch updatedb $NUTCH_HOME/bin/crawl1/crawldb $segment1

done

echo - Merge Segments (Step 5 of $steps) -

$NUTCH_HOME/bin/nutch mergesegs $NUTCH_HOME/bin/crawl1/MERGEDsegments

$NUTCH_HOME/bin/crawl1/segments/*



if [ $safe != yes ]

then

  rm $RMARGS $NUTCH_HOME/bin/crawl1/segments



else

  rm $RMARGS $NUTCH_HOME/bin/crawl1/BACKUPsegments

  mv $MVARGS $NUTCH_HOME/bin/crawl1/segments

$NUTCH_HOME/bin/crawl1/BACKUPsegments



fi



mv $MVARGS $NUTCH_HOME/bin/crawl1/MERGEDsegments

$NUTCH_HOME/bin/crawl1/segments



echo - Invert Links (Step 6 of $steps) -

$NUTCH_HOME/bin/nutch invertlinks $NUTCH_HOME/bin/crawl1/linkdb

$NUTCH_HOME/bin/crawl1/segments/*



if [ $safe != yes ]

then

  rm $RMARGS $NUTCH_HOME/bin/crawl1/NEWindexes

  rm $RMARGS $NUTCH_HOME/bin/crawl1/index





else



  rm $RMARGS $NUTCH_HOME/bin/crawl1/BACKUPindexes

  rm $RMARGS $NUTCH_HOME/bin/crawl1/BACKUPindex

  mv $MVARGS $NUTCH_HOME/bin/crawl1/NEWindexes

$NUTCH_HOME/bin/crawl1/BACKUPindexes

  mv $MVARGS $NUTCH_HOME/bin/crawl1/index $NUTCH_HOME/bin/crawl1/BACKUPindex





fi



$NUTCH_HOME/bin/nutch solrindex http://$HOST:8983/solr/

$NUTCH_HOME/bin/crawl1/crawldb $NUTCH_HOME/bin/crawl1/linkdb

$NUTCH_HOME/bin/crawl1/segments/*







but nutch don't crawl all page in any site, for example when topN=1000,

nutch crawl 700 page from site1 and 250 from site2 and 40 from site3 and 10

page from site4. i want nutch crawl 1000 page from any site.help me.



--

View this message in context: 
http://lucene.472066.n3.nabble.com/how-give-several-sites-to-nutch-to-crawl-tp3556697p3558152.html

Sent from the Nutch - User mailing list archive at Nabble.com.


 


Re: Can't crawl a domain; can't figure out why.

2011-12-20 Thread alxsss
It seems that  robots.txt in 
libraries.mit.edu 


has a lot of restrictions.

Alex.

 

-Original Message-
From: Chip Calhoun ccalh...@aip.org
To: user user@nutch.apache.org; 'markus.jel...@openindex.io' 
markus.jel...@openindex.io
Sent: Tue, Dec 20, 2011 7:28 am
Subject: RE: Can't crawl a domain; can't figure out why.


I just compared this against a similar crawl of a completely different domain 
which I know works, and you're right on both counts. The parser doesn't parse a 
file, and nothing is sent to the solrindexer. I tried a crawl with more 
documents and found that while I can get documents from mit.edu, I get 
absolutely nothing from libraries.mit.edu. I get the same effect using Nutch 
1.3 
as well.

I don't think we're dealing with truncated files. I'm willing to believe it's a 
parse error, but how could I tell? I've spoken with some helpful people from 
MIT, and they don't see a reason why this wouldn't work.

Chip

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Monday, December 19, 2011 5:01 PM
To: user@nutch.apache.org
Subject: Re: Can't crawl a domain; can't figure out why.

Nothing peculiar, looks like Nutch 1.4 right? But you also didn't mention the 
domain you can't crawl. libraries.mit.edu seems to work, although the indexer 
doesn't seem to send a document in and the parser doesn't mention parsing that 
file.

Either the file throws a parse error or is truncated or 

 I'm trying to crawl pages from a number of domains, and one of these 
 domains has been giving me trouble. The really irritating thing is 
 that it did work at least once, which led me to believe that I'd 
 solved the problem. I can't think of anything at this point but to 
 paste my log of a failed crawl and solrindex and hope that someone can 
 think of anything I've overlooked. Does anything look strange here?
 
 Thanks,
 Chip
 
 2011-12-19 16:31:01,010 WARN  crawl.Crawl - solrUrl is not set, 
 indexing will be skipped... 2011-12-19 16:31:01,404 INFO  crawl.Crawl 
 - crawl started in: mit-c-crawl 2011-12-19 16:31:01,420 INFO  
 crawl.Crawl - rootUrlDir = mit-c-urls 2011-12-19 16:31:01,420 INFO  
 crawl.Crawl - threads = 10
 2011-12-19 16:31:01,420 INFO  crawl.Crawl - depth = 1
 2011-12-19 16:31:01,420 INFO  crawl.Crawl - solrUrl=null
 2011-12-19 16:31:01,420 INFO  crawl.Crawl - topN = 50
 2011-12-19 16:31:01,420 INFO  crawl.Injector - Injector: starting at
 2011-12-19 16:31:01 2011-12-19 16:31:01,420 INFO  crawl.Injector -
 Injector: crawlDb: mit-c-crawl/crawldb 2011-12-19 16:31:01,420 INFO 
 crawl.Injector - Injector: urlDir: mit-c-urls 2011-12-19 16:31:01,436 
 INFO  crawl.Injector - Injector: Converting injected urls to crawl db entries.
 2011-12-19 16:31:02,854 INFO  plugin.PluginRepository - Plugins: 
 looking
 in: C:\Apache\apache-nutch-1.4\runtime\local\plugins 2011-12-19
 16:31:02,917 INFO  plugin.PluginRepository - Plugin Auto-activation mode:
 [true] 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository - Registered
 Plugins: 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -  
  the nutch core extension points (nutch-extensionpoints) 2011-12-19
 16:31:02,917 INFO  plugin.PluginRepository -Basic URL
 Normalizer (urlnormalizer-basic) 2011-12-19 16:31:02,917 INFO 
 plugin.PluginRepository -Html Parse Plug-in (parse-html)
 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -   
 Basic Indexing Filter (index-basic) 2011-12-19 16:31:02,917 INFO 
 plugin.PluginRepository -Http / Https Protocol Plug-in
 (protocol-httpclient) 2011-12-19 16:31:02,917 INFO 
 plugin.PluginRepository -HTTP Framework (lib-http)
 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -   
 Regex URL Filter (urlfilter-regex) 2011-12-19 16:31:02,917 INFO 
 plugin.PluginRepository -Pass-through URL Normalizer
 (urlnormalizer-pass) 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository
 -Http Protocol Plug-in (protocol-http) 2011-12-19
 16:31:02,917 INFO  plugin.PluginRepository -Regex URL
 Normalizer (urlnormalizer-regex) 2011-12-19 16:31:02,917 INFO 
 plugin.PluginRepository -Tika Parser Plug-in (parse-tika)
 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -   
 OPIC Scoring Plug-in (scoring-opic) 2011-12-19 16:31:02,917 INFO 
 plugin.PluginRepository -CyberNeko HTML Parser
 (lib-nekohtml) 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -
Anchor Indexing Filter (index-anchor) 2011-12-19 16:31:02,917
 INFO  plugin.PluginRepository -URL Meta Indexing Filter
 (urlmeta) 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository - 
   Regex URL Filter Framework (lib-regex-filter) 2011-12-19
 16:31:02,917 INFO  plugin.PluginRepository - Registered Extension-Points:
 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -   
 Nutch 

Re: Solrdedup fails due to date format

2012-02-01 Thread alxsss
Hello,

I took a look to source of SolrDeleteDuplicates class. The patch is already 
applied.
Any ideas what might be wrong? I issue this command 
  
bin/nutch solrdedup http://127.0.0.1:8983/solr/
and the solr schema is the one that comes with nutch.
  
Thanks in advance.
Alex.
 

 

-Original Message-
From: Alexander Aristov alexander.aris...@gmail.com
To: user user@nutch.apache.org
Cc: nutch-user nutch-u...@lucene.apache.org
Sent: Tue, Jan 31, 2012 9:34 pm
Subject: Re: Solrdedup fails due to date format


what is your solr schema configuration for nutch fields?



Best Regards

Alexander Aristov





On 1 February 2012 09:26, alx...@aim.com wrote:



 Hello,



 I have tried solrdedup in nutch-1.3 and 1,4. Both give

 WARNING: Error reading a field from document :

 SolrDocument[{boost=5.38071E-4, digest=79e4d5033ef83223b17c56b7c7d853b3}]

 java.lang.NumberFormatException: For input string:



 There is patch at https://issues.apache.org/jira/browse/NUTCH-986 and it

 is stated that is a fix for 1.3



 Any comment on this.



 Thanks.

 Alex.




 


Re: http.redirect.max

2012-03-01 Thread alxsss

 Hello,

I tried 1, 2, -1 for the config http.redirect.max, but nutch still postpones 
redirected urls to later depths.
What is the correct config  setting to have nutch crawl redirected urls 
immediately. I need it because I have restriction on depth be at most 2.

Thanks.
Alex.

 

 

-Original Message-
From: xuyuanme xuyua...@gmail.com
To: user user@nutch.apache.org
Sent: Fri, Feb 24, 2012 1:31 am
Subject: Re: http.redirect.max


The config file is used for some proof of concept testing so the content
might be confusing, please ignore some incorrect part.

Yes from my end I can see the crawl for website http://www.scotland.gov.uk
is redirected as expected.

However the website I tried to crawl is a bit more tricky.

Here's what I want to do:

1. Set
http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_BrowseDrugInitial=B
as the seed page

2. And try to crawl one of the link
(http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.OverviewDrugName=BACIGUENT)
as a test

If you click the link, you'll find the website use redirect and cookie to
control page navigation. So I used protocol-httpclient plugin instead of
protocol-http to handle the cookie.

However, the redirect does not happen as expected. The only way I can fetch
second link is to manually change response = getResponse(u, datum,
*false*) call to response = getResponse(u, datum, *true*) in
org.apache.nutch.protocol.http.api.HttpBase.java file and recompile the
lib-http plugin.

So my issue is related to this specific site
http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_BrowseDrugInitial=B


lewis john mcgibbney wrote
 
 I've checked working with redirects and everything seems to work fine for
 me.
 
 The site I checked on
 
 http://www.scotland.gov.uk
 
 temp redirect to
 
 http://home.scotland.gov.uk/home
 
 Nutch gets this fine when I do some tweaking with nutch-site.xml
 
 redirects property -1 (just to demonstrate, I would usually not set it so)
 
 Lewis
 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3772115.html
Sent from the Nutch - User mailing list archive at Nabble.com.

 


different fetch interval for each depth urls

2012-03-01 Thread alxsss
Hello,

I need to have different fetch intervals for initial seed urls and  urls 
extracted from them at depth 1. How this can be achieved. I tried -adddays 
option in generate command but it seems it cannot be used to solve this issue. 

Thanks in advance.
Alex.


Re: different fetch interval for each depth urls

2012-03-02 Thread alxsss

 I need to make this as a cron job, so cannot do changes manually. 
My problem is  to index newspaper sites, but only new links that are added 
every day and not fetch ones that have already been fetched.

Thanks.
Alex.

 

 

-Original Message-
From: Markus Jelsma markus.jel...@openindex.io
To: user user@nutch.apache.org
Cc: nutch-user nutch-u...@lucene.apache.org
Sent: Thu, Mar 1, 2012 10:30 pm
Subject: Re: different fetch interval for each depth urls


 Well, you could set a new default fetch interval in your configuration 
 after the first crawl cycle but the depth information is lost if you 
 continue crawling so there is no real solution.

 What problem are you trying to solve anyway?

 On Fri, 2 Mar 2012 00:19:34 -0500 (EST), alx...@aim.com wrote:
 Hello,

 I need to have different fetch intervals for initial seed urls and
 urls extracted from them at depth 1. How this can be achieved. I 
 tried
 -adddays option in generate command but it seems it cannot be used to
 solve this issue.

 Thanks in advance.
 Alex.

-- 
 Markus Jelsma - CTO - Openindex
 http://www.linkedin.com/in/markus17
 050-8536600 / 06-50258350

 


using less resources

2012-05-22 Thread alxsss
Hello,

As far as I understood nutch recrawls urls when their fetch time has past  
current time regardless if those urls were modified or not.
Is there any initiative on restricting recrawls to only those urls that have 
modified time(MT) greater than the old MT?
In detail: if nutch have crawled a  url with next fetch time in 30 days, then 
in the second recrawl nutch must visit this url, retrieve its modified time and 
compare it  with modified time that we have in the crawldb and recrawl it if 
the new MT is greater than the old one, otherwise skip it.

Thanks.
Alex.




nutch-2.0 updatedb and parse commands

2012-06-18 Thread alxsss
Hello,

It seems to me that all options to updatedb command that nutch 1.4 has, have 
been removed in nutch-2.0. I would like to know if this was done purposefully 
or they will be added later? Also, how can I create multiple doc using parse 
command? It seem there is no sufficient arguments to parse command too.

Thanks in advance.
Alex. 


Re: nutch-2.0 updatedb and parse commands

2012-06-19 Thread alxsss
Hi Lewis,

In 1.X version there are -noAdditions options to updatedb  and  -adddays option 
to generate commands. How something similar to them can be done in 2.X version?

Here, http://wiki.apache.org/nutch/Nutch2Roadmap it is stated 
Modify code so that parser can generate multiple documents which is what 1.x 
does but not 2.0
It is my understanding that 1.X's parser does not create multiple documents, 
though. Then what is the meaning of the above statement?

Thanks.
Alex.

 

 

 

-Original Message-
From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com
To: user user@nutch.apache.org
Sent: Tue, Jun 19, 2012 6:09 am
Subject: Re: nutch-2.0 updatedb and parse commands


Hi Alex,

On Mon, Jun 18, 2012 at 8:11 PM,  alx...@aim.com wrote:
 Hello,

 It seems to me that all options to updatedb command that nutch 1.4 has, have 
been removed in nutch-2.0. I would like to know if this was done purposefully 
or 
they will be added later?

As you have noticed there are a number of differences between 1.X and 2.X.

W.r.t the ones you highlight e.g. CLI, yes these are intended

 Also, how can I create multiple doc using parse command?

Can you please elaborate slightly?

 It seem there is no sufficient arguments to parse command too.

What would you like to see added? If you feel like adding
functionality then please open a ticket and if possible submit a patch
if you have time. I've been working with the parsing code and it works
fine for me but I don't fully understand your comment so if you again
could elaborate if would be excellent.

Thanks

Lewis


 Thanks in advance.
 Alex.



-- 
Lewis

 


Re: using less resources

2012-06-20 Thread alxsss
I was thinking of using last modified header, but it may be absent. In that
case we could use signature of urls in the indexing time. I took a look to
to code, it seems it is implemented but not working. I tested nutch-1.4 with
a single url, solrindexer always sends the same number of documents to solr
although none of the urls is changed.

Thanks.
Alex.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/using-less-resources-tp3985537p3990625.html
Sent from the Nutch - User mailing list archive at Nabble.com.


parse and solrindex in nutch-2.0

2012-06-25 Thread alxsss
Hello,

I have tested nutch-2.0 with hbase and mysql trying to index only one url with 
depth 1.

 I tried to fetch an html tag value and parse it to metadata column in webpage 
object by adding parse-tag plugin. I saw there is no metadata member variable 
in Parse class, so I used putToMetadata function from Webpage class and it 
turned  out that this function overwrites values for the same key, i.e, it 
keeps only the last tag value if there are multiple tags.   
 
Next 

bin/nutch solrindex http://127.0.0.1:8983/solr/ -all
SolrIndexerJob: starting
SolrIndexerJob: done.

I did 
1.bin/nutch inject
2.bin/nutch generate
3.bin/nutch fetch batchId
4.bin/nutch parse batchId
5.bin/nutch bin/nutch solrindex http://127.0.0.1:8983/solr/ -all

There is no data added to solr index with the url I tried to index.

Besides these, nutch-2.0 keeps content in the content column of webpage table 
if I put in the config 

  property
namefetcher.store.content/name
  valuefalse/value
  descriptionIf true, fetcher will store content./description
  /property


Any ideas, what is done wrong or how to fix these issues are welcome.

Thanks.
Alex.






Re: parse and solrindex in nutch-2.0

2012-07-02 Thread alxsss
Hi,

Thank you for clarifications. 
Regarding the metadata, what would be a proper way of parsing end indexing 
multivalued tags in nutch-2.0 then?

Thanks.
Alex.



-Original Message-
From: Ferdy Galema ferdy.gal...@kalooga.com
To: user user@nutch.apache.org
Sent: Wed, Jun 27, 2012 1:20 am
Subject: Re: parse and solrindex in nutch-2.0


Hi,

Correct. When using specific_batchid or -all you have to run the
updaterjob first. (Because it checks the dbupdate mark to not be null). But
a workaround is to simply run the indexer with -reindex. This will ignore
the db update mark and tries to index every parsed row (at any time).

About the metadata: It's a known limitation that there cannot be any
duplicate keys. (I'm not aware of any progress regarding this).

fetcher.store.content indeed does not seem to work. This is a bug. I
created an issue for this: NUTCH-1411

Ferdy.

On Tue, Jun 26, 2012 at 11:47 AM, Julien Nioche 
lists.digitalpeb...@gmail.com wrote:

 update (or whatever the actual name of the command is) after parsing?

 On 25 June 2012 22:35, alx...@aim.com wrote:

  Hello,
 
  I have tested nutch-2.0 with hbase and mysql trying to index only one url
  with depth 1.
 
   I tried to fetch an html tag value and parse it to metadata column in
  webpage object by adding parse-tag plugin. I saw there is no metadata
  member variable in Parse class, so I used putToMetadata function from
  Webpage class and it turned  out that this function overwrites values for
  the same key, i.e, it keeps only the last tag value if there are multiple
  tags.
 
  Next
 
  bin/nutch solrindex http://127.0.0.1:8983/solr/ -all
  SolrIndexerJob: starting
  SolrIndexerJob: done.
 
  I did
  1.bin/nutch inject
  2.bin/nutch generate
  3.bin/nutch fetch batchId
  4.bin/nutch parse batchId
  5.bin/nutch bin/nutch solrindex http://127.0.0.1:8983/solr/ -all
 
  There is no data added to solr index with the url I tried to index.
 
  Besides these, nutch-2.0 keeps content in the content column of webpage
  table if I put in the config
 
   property
 namefetcher.store.content/name
   valuefalse/value
   descriptionIf true, fetcher will store content./description
   /property
 
 
  Any ideas, what is done wrong or how to fix these issues are welcome.
 
  Thanks.
  Alex.
 
 
 
 
 


 --
 *
 *Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble


 


Re: parse and solrindex in nutch-2.0

2012-07-03 Thread alxsss
Hi,

I was planning to parse img tags from a url content and put it in metadata 
filed of Webpage storage class in nutch2.0 to retrieve them later  in the 
indexing step.
However, since there is no metadata data type variable in Parse class (compare 
with outlinks) this can not be done in nutch 2.0 (compare parse class with 
metadata type variable in nutch 1.X). One is restricted to use putToMetadata 
function of WebPage class which overwrites values, i.e.,if I try to put two 
metadata img_alt:alt1 img_alt:alt2  I get only the last value img_alt:alt2 in 
metadata field.

So, my question is how img tag alt values can be indexed in nutch-2.0, provided 
that there are more than one img tag in all crawled urls?
Do I need to parse them and store in one of the fields of webpage storage class 
or this step is not needed?

Thanks.
Alex.



-Original Message-
From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com
To: user user@nutch.apache.org
Sent: Tue, Jul 3, 2012 5:08 am
Subject: Re: parse and solrindex in nutch-2.0


Hi,

On Mon, Jul 2, 2012 at 8:21 PM,  alx...@aim.com wrote:

 Regarding the metadata, what would be a proper way of parsing end indexing 
multivalued tags in nutch-2.0 then?


Assuming you've taken a look into the schema, 'some' mutivalued fields
are permitted out of the box. Are you having problems obtaining
multiple values for some fields within the documents your trying to
parse + index?

Lewis

 


Re: updatedb in nutch-2.0 with mysql

2012-07-25 Thread alxsss
Not sure if I understood correctly. 
I did 
Counters c currentJob.getCounters();
System.out.println(c.toString());

With Mysql
 
DbUpdaterJob: starting
Counters: 20
DbUpdaterJob: starting
counter name=Counters: 20
FileSystemCounters
   FILE_BYTES_READ=878298
   FILE_BYTES_WRITTEN=992362
Map-Reduce Framework
   Combine input records=0
   Combine output records=0
   Total committed heap usage (bytes)=260177920
   CPU time spent (ms)=0
   Map input records=1
   Map output bytes=193
   Map output materialized bytes=202
   Map output records=1
   Physical memory (bytes) snapshot=0
   Reduce input groups=1
   Reduce input records=1
   Reduce output records=1
   Reduce shuffle bytes=0
   Spilled Records=2
   SPLIT_RAW_BYTES=962
   Virtual memory (bytes) snapshot=0
File Input Format Counters
   Bytes Read=0
File Output Format Counters
   Bytes Written=0
DbUpdaterJob: done


Thanks.
Alex.



-Original Message-
From: Ferdy Galema ferdy.gal...@kalooga.com
To: user user@nutch.apache.org
Sent: Wed, Jul 25, 2012 12:13 am
Subject: Re: updatedb in nutch-2.0 with mysql


Could you post the job counters?

On Tue, Jul 24, 2012 at 8:14 PM, alx...@aim.com wrote:






 Hello,



 I am testing nutch-2.0 with mysql storage with 1 url. I see that updatedb
 command does not do anything. It does not add outlinks to the table as new
 urls and I do not see any error messages in hadoop.log Here is the log
 entries without plugin load info

  INFO  crawl.DbUpdaterJob - DbUpdaterJob: starting
 2012-07-24 10:53:46,142 WARN  util.NativeCodeLoader - Unable to load
 native-hadoop library for your platform... using builtin-java classes where
 applicable
 2012-07-24 10:53:46,979 INFO  mapreduce.GoraRecordReader -
 gora.buffer.read.limit = 1
 2012-07-24 10:53:49,801 INFO  mapreduce.GoraRecordWriter -
 gora.buffer.write.limit = 1
 2012-07-24 10:53:49,806 INFO  crawl.FetchScheduleFactory - Using
 FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
 2012-07-24 10:53:49,807 INFO  crawl.AbstractFetchSchedule -
 defaultInterval=2592
 2012-07-24 10:53:49,807 INFO  crawl.AbstractFetchSchedule -
 maxInterval=2592
 2012-07-24 10:53:52,741 WARN  mapred.FileOutputCommitter - Output path is
 null in cleanup
 2012-07-24 10:53:53,584 INFO  crawl.DbUpdaterJob - DbUpdaterJob: done

 Also, I noticed that there is crawlId option to it. Where its value comes
 from?

 Btw, updatedb with no arguments works fine if Hbase is chosen for storage.

 Thanks.
 Alex.





 ~






 


Re: updatedb in nutch-2.0 with mysql

2012-07-26 Thread alxsss
I queried webpage table and there are a few links  in outlinks column. As I 
noted in the original letter updatedb works with Hbase. This is  the counters 
output in the case of Hbase.

 bin/nutch updatedb
DbUpdaterJob: starting
counter name=Counters: 20
FileSystemCounters
FILE_BYTES_READ=879085
FILE_BYTES_WRITTEN=993668
Map-Reduce Framework
Combine input records=0
Combine output records=0
Total committed heap usage (bytes)=341442560
CPU time spent (ms)=0
Map input records=1
Map output bytes=1421
Map output materialized bytes=1457
Map output records=14
Physical memory (bytes) snapshot=0
Reduce input groups=13
Reduce input records=14
Reduce output records=13
Reduce shuffle bytes=0
Spilled Records=28
SPLIT_RAW_BYTES=701
Virtual memory (bytes) snapshot=0
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
DbUpdaterJob: done
 
I tried crawling http://www.yahoo.com . The same issue is present.

Thanks.
Alex.



-Original Message-
From: Ferdy Galema ferdy.gal...@kalooga.com
To: user user@nutch.apache.org
Sent: Thu, Jul 26, 2012 6:26 am
Subject: Re: updatedb in nutch-2.0 with mysql


Yep I meant those counters.

Looking at the code it seems just 1 record is passed around from mapper to
reducer:This can only mean that no outlinks are outputted in the mapper.
This might indicate that the url is not succesfully parsed. (Did you parse
at all?)

Are you able to peek in (or dump) your database with an external tool to
see if outlinks are present before running the updater? Or perhaps check
some parser log?

On Wed, Jul 25, 2012 at 10:02 PM, alx...@aim.com wrote:

 Not sure if I understood correctly.
 I did
 Counters c currentJob.getCounters();
 System.out.println(c.toString());

 With Mysql

 DbUpdaterJob: starting
 Counters: 20
 DbUpdaterJob: starting
 counter name=Counters: 20
 FileSystemCounters
FILE_BYTES_READ=878298
FILE_BYTES_WRITTEN=992362
 Map-Reduce Framework
Combine input records=0
Combine output records=0
Total committed heap usage (bytes)=260177920
CPU time spent (ms)=0
Map input records=1
Map output bytes=193
Map output materialized bytes=202
Map output records=1
Physical memory (bytes) snapshot=0
Reduce input groups=1
Reduce input records=1
Reduce output records=1
Reduce shuffle bytes=0
Spilled Records=2
SPLIT_RAW_BYTES=962
Virtual memory (bytes) snapshot=0
 File Input Format Counters
Bytes Read=0
 File Output Format Counters
Bytes Written=0
 DbUpdaterJob: done


 Thanks.
 Alex.



 -Original Message-
 From: Ferdy Galema ferdy.gal...@kalooga.com
 To: user user@nutch.apache.org
 Sent: Wed, Jul 25, 2012 12:13 am
 Subject: Re: updatedb in nutch-2.0 with mysql


 Could you post the job counters?

 On Tue, Jul 24, 2012 at 8:14 PM, alx...@aim.com wrote:

 
 
 
 
 
  Hello,
 
 
 
  I am testing nutch-2.0 with mysql storage with 1 url. I see that updatedb
  command does not do anything. It does not add outlinks to the table as
 new
  urls and I do not see any error messages in hadoop.log Here is the log
  entries without plugin load info
 
   INFO  crawl.DbUpdaterJob - DbUpdaterJob: starting
  2012-07-24 10:53:46,142 WARN  util.NativeCodeLoader - Unable to load
  native-hadoop library for your platform... using builtin-java classes
 where
  applicable
  2012-07-24 10:53:46,979 INFO  mapreduce.GoraRecordReader -
  gora.buffer.read.limit = 1
  2012-07-24 10:53:49,801 INFO  mapreduce.GoraRecordWriter -
  gora.buffer.write.limit = 1
  2012-07-24 10:53:49,806 INFO  crawl.FetchScheduleFactory - Using
  FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
  2012-07-24 10:53:49,807 INFO  crawl.AbstractFetchSchedule -
  defaultInterval=2592
  2012-07-24 10:53:49,807 INFO  crawl.AbstractFetchSchedule -
  maxInterval=2592
  2012-07-24 10:53:52,741 WARN  mapred.FileOutputCommitter - Output path is
  null in cleanup
  2012-07-24 10:53:53,584 INFO  crawl.DbUpdaterJob - DbUpdaterJob: done
 
  Also, I noticed that there is crawlId option to it. Where its value comes
  from?
 
  Btw, updatedb with no arguments works fine if Hbase is chosen for
 storage.
 
  Thanks.
  Alex.
 
 
 
 
 
  ~
 
 
 
 
 




 


Re: updatedb in nutch-2.0 with mysql

2012-07-27 Thread alxsss
I tried your suggestion with sql server and everything works fine. 
The issue that I had was with mysql though. 
mysql  Ver 14.14 Distrib 5.5.18, for Linux (i686) using readline 5.1

After I have restarted mysql server and added to gora.properties  mysql root 
user, updatdb adds outlinks as new urls, but as I noticed it did not remove 
values of 
prsmrk, gnmrk and ftcmrk as it happens in Hbase and as follows from code  
Mark.GENERATE_MARK.removeMarkIfExist(page);... in DbUpdateReducer.java

I also see from time to time an error that text filed has size less than 
expected.

It seems to me that nutch with mysql is still buggy, so I gave up using mysql 
with it in favor of Hbase.

Thanks for your help.
Alex.




-Original Message-
From: Ferdy Galema ferdy.gal...@kalooga.com
To: user user@nutch.apache.org
Sent: Fri, Jul 27, 2012 2:03 am
Subject: Re: updatedb in nutch-2.0 with mysql


I've just ran a crawl with Nutch 2.0 tag using the SqlStore. Please try to
reproduce from a clean checkout/download.

nano conf/nutch-site.xml #set http.agent.name and http.robots.agents
properties
ant clean runtime
java -cp runtime/local/lib/hsqldb-2.2.8.jar org.hsqldb.Server -database.0
mem:0 -dbname.0 nutchtest #start sql server

#open another terminal

cd runtime/local
bin/nutch inject ~/urlfolderWithOneUrl/
bin/nutch generate
bin/nutch fetch batchIdFromGenerate
bin/nutch parse batchIdFromGenerate
bin/nutch updatedb
bin/nutch readdb -stats #this will show multiple entries
bin/nutch readdb -dump out #this will dump a readable text file in folder
out/ (with multiple entries)

If this works as expected, it might be something with your sql server?
(What server are you running exactly?)

Ferdy.

On Thu, Jul 26, 2012 at 8:15 PM, alx...@aim.com wrote:

 I queried webpage table and there are a few links  in outlinks column. As
 I noted in the original letter updatedb works with Hbase. This is  the
 counters output in the case of Hbase.

  bin/nutch updatedb
 DbUpdaterJob: starting
 counter name=Counters: 20
 FileSystemCounters
 FILE_BYTES_READ=879085
 FILE_BYTES_WRITTEN=993668
 Map-Reduce Framework
 Combine input records=0
 Combine output records=0
 Total committed heap usage (bytes)=341442560
 CPU time spent (ms)=0
 Map input records=1
 Map output bytes=1421
 Map output materialized bytes=1457
 Map output records=14
 Physical memory (bytes) snapshot=0
 Reduce input groups=13
 Reduce input records=14
 Reduce output records=13
 Reduce shuffle bytes=0
 Spilled Records=28
 SPLIT_RAW_BYTES=701
 Virtual memory (bytes) snapshot=0
 File Input Format Counters
 Bytes Read=0
 File Output Format Counters
 Bytes Written=0
 DbUpdaterJob: done

 I tried crawling http://www.yahoo.com . The same issue is present.

 Thanks.
 Alex.



 -Original Message-
 From: Ferdy Galema ferdy.gal...@kalooga.com
 To: user user@nutch.apache.org
 Sent: Thu, Jul 26, 2012 6:26 am
 Subject: Re: updatedb in nutch-2.0 with mysql


 Yep I meant those counters.

 Looking at the code it seems just 1 record is passed around from mapper to
 reducer:This can only mean that no outlinks are outputted in the mapper.
 This might indicate that the url is not succesfully parsed. (Did you parse
 at all?)

 Are you able to peek in (or dump) your database with an external tool to
 see if outlinks are present before running the updater? Or perhaps check
 some parser log?

 On Wed, Jul 25, 2012 at 10:02 PM, alx...@aim.com wrote:

  Not sure if I understood correctly.
  I did
  Counters c currentJob.getCounters();
  System.out.println(c.toString());
 
  With Mysql
 
  DbUpdaterJob: starting
  Counters: 20
  DbUpdaterJob: starting
  counter name=Counters: 20
  FileSystemCounters
 FILE_BYTES_READ=878298
 FILE_BYTES_WRITTEN=992362
  Map-Reduce Framework
 Combine input records=0
 Combine output records=0
 Total committed heap usage (bytes)=260177920
 CPU time spent (ms)=0
 Map input records=1
 Map output bytes=193
 Map output materialized bytes=202
 Map output records=1
 Physical memory (bytes) snapshot=0
 Reduce input groups=1
 Reduce input records=1
 Reduce output records=1
 Reduce shuffle bytes=0
 Spilled Records=2
 SPLIT_RAW_BYTES=962
 Virtual memory (bytes) snapshot=0
  File Input Format Counters
 Bytes Read=0
  File Output Format Counters
 Bytes Written=0
  DbUpdaterJob: done
 
 
  Thanks.
  Alex.
 
 
 
  -Original Message-
  From: Ferdy Galema 

Re: Nutch 2.0 Solr 4.0 Alpha

2012-07-29 Thread alxsss
Which storage do you use? Try solrindex with option -reindex.



-Original Message-
From: X3C TECH t...@x3chaos.com
To: user user@nutch.apache.org
Sent: Sun, Jul 29, 2012 10:58 am
Subject: Re: Nutch 2.0  Solr 4.0 Alpha


Forgot to do Specs
VMWare Machine with CentOS 6.3


On Sun, Jul 29, 2012 at 1:53 PM, X3C TECH t...@x3chaos.com wrote:

 Hello,
 Has anyone been successful in hooking up Nutch 2 with Solr4?
 I seem to have my config screwed up somehow. I've added the Nutch fields
 to Solr's example schema and changed the field type from text' to
 text_general
 However when I index, I get the message
 SolrIndexerJob:starting
 SolrIndexerJob:Done
 but nothing has been indexed. Hadoop log shows no errors, neither does
 Solr terminal window. I even tried installing Solr 3.6.1 and copying the
 schema file as is, with no luck, same issue. Does something need to be
 adjusted in Nutch config? I made no adjustment when I built it, so it's
 stock beyond adjustments to hook up Hbase listed in tutorial. Your help is
 highly appreciated, as I'm really boggled by this!!

 Iggy


 


Re: Why won't my crawl ignore these urls?

2012-07-30 Thread alxsss
Why do not you test your regex, to see if it really takes the urls you want to 
eliminate. It seems to me that your regex does not eliminate the type of urls 
you specified.

Alex.



-Original Message-
From: Ian Piper ianpi...@tellura.co.uk
To: user user@nutch.apache.org
Sent: Mon, Jul 30, 2012 1:52 pm
Subject: Re: Why won't my crawl ignore these urls?


Hi again,

Regarding disabling filters. I just checked in my nutch-default.xml and 
nutch-site.xml files. There is no reference to crawl.generate in either, which 
seems (http://wiki.apache.org/nutch/bin/nutch_generate) to suggest that urls 
should be filtered.


Ian.
--
On 30 Jul 2012, at 19:06, Markus Jelsma wrote:

 Hi,
 
 Either your regex is wrong, you haven't updated the CrawlDB with the new 
filters and/or you disabled filtering in the Generator.
 
 Cheers
 
 
 
 -Original message-
 From:Ian Piper ianpi...@tellura.co.uk
 Sent: Mon 30-Jul-2012 20:01
 To: user@nutch.apache.org
 Subject: Why won't my crawl ignore these urls?
 
 Hi all,
 
 I have been trying to get to the bottom of this problem for ages and cannot 
resolve it - you're my last hope, Obi-Wan...
 
 I have a job that crawls over a client's site. I want to exclude urls that 
look like this:
 
 http://[clientsite.net]/resources/type.aspx?type=[whatever] 
http://[clientsite.net]/resources/type.aspx?type=[whatever] 
 
 and
 
 http://[clientsite.net]/resources/topic.aspx?topic=[whatever] 
http://[clientsite.net]/resources/topic.aspx?topic=[whatever] 
 
 
 To achieve this I thought I could put this into conf/regex-urlfilter.txt:
 
 [...]
 -^http://([a-z0-9\-A-Z]*\.)*www.elaweb.org.uk/resources/type.aspx.*
 -^http://([a-z0-9\-A-Z]*\.)*www.elaweb.org.uk/resources/topic.aspx.*
 [...]
 
 Yet when I next run the crawl I see things like this:
 
 fetching http://[clientsite.net]/resources/topic.aspx?topic=10 
http://[clientsite.net]/resources/topic.aspx?topic=10 
 -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=37
 [...]
 fetching http://[clientsite.net]/resources/type.aspx?type=2 
http://[clientsite.net]/resources/type.aspx?type=2 
 -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=36
 [...]
 
 and the corresponding pages seem to appear in the final Solr index. So 
clearly they are not being excluded.
 
 Is anyone able to explain what I have missed? Any guidance much appreciated.
 
 Thanks,
 
 
 Ian.
 -- 
 Dr Ian Piper
 Tellura Information Services - the web, document and information people
 Registered in England and Wales: 5076715, VAT Number: 874 2060 29
 http://www.tellura.co.uk/ http://www.tellura.co.uk/ 
 Creator of monickr: http://monickr.com http://monickr.com/ 
 01926 813736 | 07973 156616
 -- 
 
 

-- 
Dr Ian Piper
Tellura Information Services - the web, document and information people
Registered in England and Wales: 5076715, VAT Number: 874 2060 29
http://www.tellura.co.uk/
Creator of monickr: http://monickr.com
01926 813736 | 07973 156616
-- 


 


 


Re: Different batch id

2012-07-31 Thread alxsss
Hi,

Most likely you run generate command a few times and did not run updatedb. So, 
each generate command assigned different batchId s to its own set of urls.

Alex.



-Original Message-
From: Bai Shen baishen.li...@gmail.com
To: user user@nutch.apache.org
Sent: Tue, Jul 31, 2012 10:26 am
Subject: Re: Different batch id


Is there a specific place it's located?  I turned on debugging, but I'm not
seeing a batch id.

On Mon, Jul 30, 2012 at 1:14 PM, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com wrote:

 Can you stick on debug logging and see what the batch ID's actually are?

 On Mon, Jul 30, 2012 at 6:12 PM, Bai Shen baishen.li...@gmail.com wrote:
  I set up Nutch 2.x with a new instance of HBase.  I ran the following
  commands.
 
  bin/nutch inject urls
  bin/nutch generate -topN 1000
  bin/nutch fetch -all
  bin/nutch parse -all
 
  When looking at the parse log, I'm seeing a bunch of different batch id
  messages.  These are all on urls that I did not inject into the database.
 
  Any ideas what's causing this?
 
  Thanks.



 --
 Lewis


 


updatedb fails to put UPDATEDB_MARK in nutch-2.0

2012-07-31 Thread alxsss



Hello,


I noticed that updatedb command must remove gen, parse and fetch marks and put 
UPDATEDB_MARK mark.
according to the code 
 Utf8 mark = Mark.PARSE_MARK.removeMarkIfExist(page);
if (mark != null) {
  Mark.UPDATEDB_MARK.putMark(page, mark);
}

in DbUpdateReducer.java

However, outputting markers in Hbase shows that updatedb removes all marks, 
except injector one and does not put  UPDATEDB_MARK.

Thanks.
Alex.





 


Re: Nutch 2 solrindex

2012-08-01 Thread alxsss
This is directly related to the thread I have opened yesterday. I think this is 
a bug, since updatedb fails to put update mark.
I have fixed it by modifying code. I have a patch, but not sure if I can send 
it as an attachment.

Alex.



-Original Message-
From: Bai Shen baishen.li...@gmail.com
To: user user@nutch.apache.org
Sent: Wed, Aug 1, 2012 10:37 am
Subject: Nutch 2 solrindex


I'm trying to crawl using Nutch 2.  However, I can't seem to get it to
index to solr without adding -reindex to the command.  And at that point it
indexes everything I've crawled.  I've tried both -all and the batch id,
but neither one results in anything being indexed to solr.

Any suggestions of what to look at?

Thanks.

 


Re: Nutch 2 solrindex

2012-08-02 Thread alxsss
The current code putting updb_mrk in dbUpdateReducer is as follows

Utf8 mark = Mark.PARSE_MARK.removeMarkIfExist(page);
if (mark != null) {
  Mark.UPDATEDB_MARK.putMark(page, mark); 
   }
the mark is always null, independent if there is PARSE_MARK or not.

This function calls

 public Utf8 removeFromMarkers(Utf8 key) {
if (markers == null) { return null; }
getStateManager().setDirty(this, 20);
return markers.remove(key);
  }

it seems to me that getStateManager().setDirty(this, 20); removes marker and 
that is why the last line  returns null.

I tried to follow  getStateManager().setDirty(this, 20)  in the hierarchy of 
classes, but did not find anything useful.

I  have fixed the issue by replacing the above lines with

Utf8 parse_mark = Mark.PARSE_MARK.checkMark(page);
if (parse_mark != null)
{
Mark.UPDATEDB_MARK.putMark(page, parse_mark);
Mark.PARSE_MARK.removeMark(page);
 }

Thanks.
Alex.



-Original Message-

From: Ferdy Galema ferdy.gal...@kalooga.com
To: user user@nutch.apache.org
Sent: Thu, Aug 2, 2012 12:16 am
Subject: Re: Nutch 2 solrindex


Hi,

Do you want to open a Jira and attach the patch over there? Or just explain
what the problem is caused. I'm curious to what this might be.

Thanks,
Ferdy.

On Wed, Aug 1, 2012 at 9:27 PM, alx...@aim.com wrote:

 This is directly related to the thread I have opened yesterday. I think
 this is a bug, since updatedb fails to put update mark.
 I have fixed it by modifying code. I have a patch, but not sure if I can
 send it as an attachment.

 Alex.



 -Original Message-
 From: Bai Shen baishen.li...@gmail.com
 To: user user@nutch.apache.org
 Sent: Wed, Aug 1, 2012 10:37 am
 Subject: Nutch 2 solrindex


 I'm trying to crawl using Nutch 2.  However, I can't seem to get it to
 index to solr without adding -reindex to the command.  And at that point it
 indexes everything I've crawled.  I've tried both -all and the batch id,
 but neither one results in anything being indexed to solr.

 Any suggestions of what to look at?

 Thanks.




 


Re: Different batch id

2012-08-02 Thread alxsss
Hi,

I have found out that, what happens after 

bin/nutch generate -topN 1000

is that only 1000 of the urls have been marked by gnmrk

Then 
bin/nutch fetch -all

skips all urls that do not have gnmrk
according to the code 
Utf8 mark = Mark.GENERATE_MARK.checkMark(page);
 if (!NutchJob.shouldProcess(mark, batchId)) {
if (LOG.isDebugEnabled()) {
  LOG.debug(Skipping  + TableUtil.unreverseUrl(key) + ; different 
batch id ( + mark + ));
}
return;
  }

since shouldProcess(mark, batchId) returns false if mark is null.

Then

bin/nutch parse -all
skips all urls that do not have fetch mark
according to the code
 Utf8 mark = Mark.FETCH_MARK.checkMark(page);
  String unreverseKey = TableUtil.unreverseUrl(key);
  if (!NutchJob.shouldProcess(mark, batchId)) {
LOG.info(Skipping  + unreverseKey + ; different batch id);
return;
  }

this outputs to log as INFO and are those that you see in log file.

So, it seems to me that -all option to fetch, parse and solrindex do not work 
as expected.

Alex. 



-Original Message-
From: Bai Shen baishen.li...@gmail.com
To: user user@nutch.apache.org
Sent: Thu, Aug 2, 2012 5:59 am
Subject: Re: Different batch id


I just tried running this with the actual batch Id instead of using -all,
and I'm still getting similar results.

On Mon, Jul 30, 2012 at 1:12 PM, Bai Shen baishen.li...@gmail.com wrote:

 I set up Nutch 2.x with a new instance of HBase.  I ran the following
 commands.

 bin/nutch inject urls
 bin/nutch generate -topN 1000
 bin/nutch fetch -all
 bin/nutch parse -all

 When looking at the parse log, I'm seeing a bunch of different batch id
 messages.  These are all on urls that I did not inject into the database.

 Any ideas what's causing this?

 Thanks.


 


Re: Nutch 2 encoding

2012-08-09 Thread alxsss
Hi,

I use hbase-0.92.1 and do not have problem with utf-8 chars. What is exactly 
your problem?

Alex.


-Original Message-
From: Ake Tangkananond iam...@gmail.com
To: user user@nutch.apache.org
Sent: Thu, Aug 9, 2012 11:12 am
Subject: Re: Nutch 2 encoding


Hi,

I'm debugging.

I inserted a code to print out the encoding here in HtmlParser:java
function getParse and it printed utf-8. So I think it might be the data
store problem. What else could be the cause? Could you advise what next I
should go for to have my Thai chars stored correctly in HBase? Can I
simply go with the latest version of HBase? (Not sure if it is compatible
with nutch 2.0)


byte[] contentInOctets = page.getContent().array();
  InputSource input = new InputSource(new
ByteArrayInputStream(contentInOctets));

  EncodingDetector detector = new EncodingDetector(conf);
  detector.autoDetectClues(page, true);
  detector.addClue(sniffCharacterEncoding(contentInOctets), sniffed);
  String encoding = detector.guessEncoding(page, defaultCharEncoding);

  metadata.set(Metadata.ORIGINAL_CHAR_ENCODING, encoding);
  metadata.set(Metadata.CHAR_ENCODING_FOR_CONVERSION, encoding);

LOG.info(encoding :  + encoding);
  input.setEncoding(encoding);



Regards,
Ake Tangkananond



On 8/9/12 11:06 PM, Ake Tangkananond iam...@gmail.com wrote:

Hi,

Sorry for late reply. I was trying to figure out myself but seem no luck.

I'm on Hbase with local deploy version 0.90.6, r1295128, the working
version as said in Wiki:
http://wiki.apache.org/nutch/Nutch2Tutorial


Regards,
Ake Tangkananond




On 8/9/12 10:30 PM, Ferdy Galema ferdy.gal...@kalooga.com wrote:

It depends on the datastore and possibly the server? What store are you
using?

On Thu, Aug 9, 2012 at 4:05 PM, Ake Tangkananond iam...@gmail.com
wrote:

 Hi all,

 I just wonder if Nutch 2 is working fine with non english characters in
 your
 deployment? Thai language used to work fine for me in Nutch 1.5 but not
in
 Nutch 2. Did I miss something. Anything I should check.

 Sorry for silly questions, but thank you in advance. ;-)


 Regards,
 Ake Tangkananond








 


Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2012-08-11 Thread alxsss
Hello,

I am getting the same error and here is the log

2012-08-11 13:33:08,223 ERROR http.Http - Failed with the following error:
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at
java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:178)
at
org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:243)
at
org.apache.nutch.protocol.http.HttpResponse.init(HttpResponse.java:161)
at org.apache.nutch.protocol.http.Http.getResponse(Http.java:68)
at
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:142)
at
org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:521)

Thanks.
Alex.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/java-lang-OutOfMemoryError-GC-overhead-limit-exceeded-tp334p4000616.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2012-08-11 Thread alxsss

I was able to do jstack just before the program exited. The output is attached.



 

 

 

-Original Message-
From: alxsss alx...@aim.com
To: user user@nutch.apache.org
Sent: Sat, Aug 11, 2012 2:17 pm
Subject: Re: java.lang.OutOfMemoryError: GC overhead limit exceeded


Hello,



I am getting the same error and here is the log



2012-08-11 13:33:08,223 ERROR http.Http - Failed with the following error:

java.lang.OutOfMemoryError: Java heap space

at java.util.Arrays.copyOf(Arrays.java:2271)

at

java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:178)

at

org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:243)

at

org.apache.nutch.protocol.http.HttpResponse.init(HttpResponse.java:161)

at org.apache.nutch.protocol.http.Http.getResponse(Http.java:68)

at

org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:142)

at

org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:521)



Thanks.

Alex.







--

View this message in context: 
http://lucene.472066.n3.nabble.com/java-lang-OutOfMemoryError-GC-overhead-limit-exceeded-tp334p4000616.html

Sent from the Nutch - User mailing list archive at Nabble.com.


 


updatedb error in nutch-2.0

2012-08-12 Thread alxsss


Hello,


I get the following error when I do bin/nutch updatedb in nutch-2.0 with hbase

java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98)
at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:54)
at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:37)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)

I see this is because of reversing and unreversing urls. What is the idea 
behind this reversal and unreversal in nutch-2.0?

Thanks.
Alex.

 


Re: updatedb error in nutch-2.0

2012-08-13 Thread alxsss
I found out that the key sent to 
unreverseUrl in DbUpdateMapper.map  was :index.php/http


This happened in the depth 3 and I checked seed file there was no line in the 
form of http:/index.php

Thanks.
Alex.



-Original Message-
From: Ferdy Galema ferdy.gal...@kalooga.com
To: user user@nutch.apache.org
Sent: Mon, Aug 13, 2012 1:53 am
Subject: Re: updatedb error in nutch-2.0


Hi,

In the specific case of Alex, it means that a row name in the database is
malformed. Looking at the stacktrace lines in TableUtil, it looks like an
url is stored without protocol (at least without a :). This is probably
because of redirected urls not correctly being checked for wellformedness.
If you look at line 664 in the FetcherReducer (HEAD) it writes out a new
url directly as a row in the database. I have never experienced this
exception and this might be because I changed some behaviour that makes
sure a redirected url is handled a bit more like a general outlink. I have
created an issue for this that I will update shortly:
https://issues.apache.org/jira/browse/NUTCH-1448

Ferdy.

On Mon, Aug 13, 2012 at 2:52 AM, j.sulli...@thomsonreuters.com wrote:

 The url is stored in a different order (reversed domain
 name:protocol:port and path) from the order normally seen in your web
 browser so that it can be searched more quickly in NoSQL data stores
 like hbase. Nutch has a brief explanation and convenience utility
 methods around this at TableUtil
 (http://nutch.apache.org/apidocs-2.0/org/apache/nutch/util/TableUtil.htm
 l)


 -Original Message-
 From: alx...@aim.com [mailto:alx...@aim.com]
 Sent: Monday, August 13, 2012 9:25 AM
 To: user@nutch.apache.org
 Subject: updatedb error in nutch-2.0



 Hello,


 I get the following error when I do bin/nutch updatedb in nutch-2.0 with
 hbase

 java.lang.ArrayIndexOutOfBoundsException: 1
 at
 org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98)
 at
 org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:54)
 at
 org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:37)
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
 at
 org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
 at
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)

 I see this is because of reversing and unreversing urls. What is the
 idea behind this reversal and unreversal in nutch-2.0?

 Thanks.
 Alex.




 


Re: nutch 2.0 with hbase 0.94.0

2012-08-13 Thread alxsss
did you delete the old hbase jar from the lib dir?

Alex.



-Original Message-
From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com
To: user user@nutch.apache.org
Sent: Mon, Aug 13, 2012 10:16 am
Subject: Re: nutch 2.0 with hbase 0.94.0


Nutch contains no knowledge of which specific version of a backend you
are using. This is however done through the gora-* dependencies
managed by Ivy.

Although this is a pretty convoluted way to do things, the best way to
find this would be to check out Gora trunk [0], upgrade the hbase
dependencies to whatever you need, compile and package the project
then copy the relevant jar's over to your Nutch installation. This way
you could run a standalone (development) hbase server and try running
your Nutch configuration that way...

hth

Lewis

[0] http://svn.apache.org/repos/asf/gora/trunk/

On Mon, Aug 13, 2012 at 6:11 PM, Ryan L. Sun lishe...@gmail.com wrote:
 hi all,

 I'm trying to set up nutch 2.0 with a existing hbase cluster (using
 hbase 0.94.0). Since nutch 2.0 supports an older version (0.90.4) of
 hbase, starting a nutch inject job crashed hbase daemon. Copying hbase
 0.94.0's lib to nutch/runtime/local/lib folder as google search hinted
 doesn't work for me.
 Any suggestions are appreciated. Thanks.

 PS. I couldn't downgrade the existing hbase cluster software version,
 which is out of my hand.



-- 
Lewis

 


updatedb goes over all urls in nutch-2.0

2012-08-17 Thread alxsss
Hi,

I noticed that updatedb command goes over all urls, even if they have been 
updated in the previous generate, fetch updatedb stages.
As a result updatedb takes long time depending on the number of rows in the 
datastore.
I thought maybe this is redundant and it must be restricted to not updated 
urls, only.

Thanks.
Alex.


fetcher fails on connection error in nutch-2.0 with hbase

2012-08-19 Thread alxsss
After fetching for about 18 hours fetcher throws this error

java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701)
at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:489)
at 
org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupConnection(HBaseClient.java:328)
at 
org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:362)
at 
org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1045)
at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:897)
at 
org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:150)
at $Proxy6.getClosestRowBefore(Unknown Source)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:947)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:814)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:788)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1024)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:818)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchCallback(HConnectionManager.java:1524)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1409)
at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:943)
at org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:820)
at org.apache.hadoop.hbase.client.HTable.put(HTable.java:795)


 WARN  zookeeper.ClientCnxn - Session 0x1393cf29d5e0003 for server null, 
unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
2012-08-19 13:26:56,935 WARN  zookeeper.ClientCnxn - Session 0x1393cf29d5e0003 
for server null, unexpected error, closing socket connection and attempting 
reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
2012-08-19 13:26:57,075 WARN  zookeeper.RecoverableZooKeeper - Possibly 
transient ZooKeeper exception: 
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss for /hbase

 
repeadetely and then fails. I checked the hbase process was alive.

Any ideas what could cause this issue?

Thanks.
Alex.


speed of fetcher in nutch-2.0

2012-08-23 Thread alxsss
Hello,

I am using nutch-2.0 with hbase-0.92.1. I noticed that, in depth 1, 2,3  
fetcher was fetching around 20K urls per hour. In depth 4 it fetches only 8K 
urls per hour.
Any ideas what could cause this decrease in speed.  I use local mode with 10 
threads.

Thanks.
Alex.


 


Re: recrawl a URL?

2012-08-24 Thread alxsss
This will work only for urls that has If-Modified-Since headers. But most urls 
does not have this header.

Thanks.
Alex. 
 

 

 

-Original Message-
From: Max Dzyuba max.dzy...@comintelli.com
To: Markus Jelsma markus.jel...@openindex.io; user user@nutch.apache.org
Sent: Fri, Aug 24, 2012 9:02 am
Subject: RE: recrawl a URL?


Thanks again! I'll have to test it more then in my 1.5.1.




Best regards,
MaxMarkus Jelsma markus.jel...@openindex.io wrote:Hmm, i had to look it up 
but 
it is supported in 1.5 and 1.5.1:

http://svn.apache.org/viewvc/nutch/tags/release-1.5.1/src/java/org/apache/nutch/indexer/IndexerMapReduce.java?view=markup


-Original message-
 From:Max Dzyuba max.dzy...@comintelli.com
 Sent: Fri 24-Aug-2012 17:35
 To: Markus Jelsma markus.jel...@openindex.io; user@nutch.apache.org
 Subject: RE: recrawl a URL?
 
 Thank you for the reply. Does it mean that it is not supported in latest 
stable release of Nutch?
 
 
 -Original Message-
 From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
 Sent: den 24 augusti 2012 17:21
 To: user@nutch.apache.org; Max Dzyuba
 Subject: RE: recrawl a URL?
 
 Hi,
 
 Trunk has a feature for this: indexer.skip.notmodified
 
 Cheers 
  
 -Original message-
  From:Max Dzyuba max.dzy...@comintelli.com
  Sent: Fri 24-Aug-2012 17:19
  To: user@nutch.apache.org
  Subject: recrawl a URL?
  
  Hello everyone,
  
   
  
  I run a crawl command every day, but I don't want Nutch to submit an 
  update to Solr if a particular page hasn't changed. How do I achieve 
  that? Right now the value of db.fetch.interval.default doesn't seem to 
  help prevent the crawl since the updates are submitted to Solr as if 
  the page has been changed. I know for sure that the page has not been 
  changed. This happens for every new crawl command.
  
   
  
   
  
  Thanks in advance,
  
  Max
  
  
 
 


 


Re: Nutch 2 solrindex fails with no error

2012-09-17 Thread alxsss

You can use -reindex option, since updt markers are not set properly in 2.0 
release. 

 

 

-Original Message-
From: Bai Shen baishen.li...@gmail.com
To: user user@nutch.apache.org
Sent: Mon, Sep 17, 2012 10:16 am
Subject: Re: Nutch 2 solrindex fails with no error


The problem appears to be that Nutch is not sending anything to solr.  But
I can't seem to find a reason in nutch as to why this is.

On Sat, Sep 15, 2012 at 7:36 AM, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com wrote:

 Solr logs?

 On Fri, Sep 14, 2012 at 9:33 PM, Bai Shen baishen.li...@gmail.com wrote:
  I have a nutch 2 setup that I got working with solr about a month ago.  I
  had to shelve it for a little while and I've recently come back to it.
 
  Everything seems to be working fine except for the solr indexing.  To my
  knowledge, nothing has changed between then and now, but whenever I go to
  perform a solrindex, nothing gets index.  My hbase, hadoop, and solr logs
  are all devoid of errors.  The only thing I get in the command line is
 the
  following.
 
  SolrIndexerJob: starting
  SolrIndexerJob: done.
 
 
  Any suggestions of where to look to begin troubleshooting this would be
  appreciated.  I'm baffled.
 
  Thanks.



 --
 Lewis


 


updatedb in nutch-2.0 increases fetch time of all pages

2012-09-17 Thread alxsss
Hello,

updatedb in nutch-2.0 increases fetch time of all pages independent of if they 
have already been fetched or not.
For example if updatedb is applied in depth 1 and page A is fetched and its 
fetchTime is 30 days from now, then as a result of running updatedb in depth 2 
fetch time of page A will be 60 days from now and so on.

Also, I wondered if it is possible to remove pages that do not pass filters 
from hbase datastore by using updatedb?.

Thanks.
Alex.


Re: Building Nutch 2.0

2012-10-01 Thread alxsss
It seems to me that if you run nutch in deploy mode and make changes to config 
files then you need to rebuild .job file again unless you specify config_dir 
option in hadoop command.

Alex.
 

-Original Message-
From: Christopher Gross cogr...@gmail.com
To: user user@nutch.apache.org
Sent: Mon, Oct 1, 2012 1:22 pm
Subject: Re: Building Nutch 2.0


I have my 1.3 set up in a /proj/nutch/ directory that has the bin,
conf, lib, logs, ..etc.., with NUTCH_HOME pointing there.  I don't
quite see what the difference would be for 2.x as long as NUTCH_HOME
pointed to the right place.

Is there documentation anywhere on how to do a deployment?

-- Chris


On Mon, Oct 1, 2012 at 3:59 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
 Hi Chris,

 On Mon, Oct 1, 2012 at 8:52 PM, Christopher Gross cogr...@gmail.com wrote:
 OK, I added the port being used by hbase to iptables, and now I'm farther.

 I'm getting:
 12/10/01 19:44:17 ERROR fetcher.FetcherJob: Fetcher: No agents listed
 in 'http.agent.name' property.

 But I do have an entry there, and it matches the first in the
 robots.agents as well.

 This can only mean that you have not recompiled this stuff into the
 runtime/local directory.


 How should I have this laid out?  Should I be running out of the
 'runtime' dir, or is it fine that I've pulled all those files out and
 into a /proj/nutch-2.1/ directory (so there's a bin, conf, lib,
 ..etc.. in there, with NUTCH_HOME pointing to that dir).

 OK so you are running locally. I can't say whether its OK to copy the
 directories and their content elsewhere as I've never done it however
 I would avoid unless absolutely necessary. It terms of the directory
 layout Nutch 2.x is identical to 1.x.

 It really helps if you make explicit which back end you intend to use
 as the config may alter accordingly.

 


nutch-2.0 generate in deploy mode

2012-10-01 Thread alxsss
Hello,

I use nutch-2.0 with hadoop-0.20.2. bin/nutch generate  command takes 87% of 
cpu  in deploy mode versus 18% in local mode.
Any ideas how to fix this issue?

Thanks.
Alex.


Re: Building Nutch 2.0

2012-10-02 Thread alxsss
According to code in bin/nutch if you have .job file in you NUTCH_HOME then it 
means that you run it in deploy mode. If there is no .job file then you run it 
in local mode, so you do not need to build nutch each time you change conf 
files.

Alex.

 

 

 

-Original Message-
From: Christopher Gross cogr...@gmail.com
To: user user@nutch.apache.org
Sent: Tue, Oct 2, 2012 5:31 am
Subject: Re: Building Nutch 2.0


Well i'm not using the deploy directory (and I can't get the hadoop
to work, so the .job file shouldn't matter).

I just don't see how changing the configurations (like the agent name
string) would warrant rebuilding the project.  I can understand it if
you're switching between the storage mechanism (MySQL db vs HBase)
because it is only including what is necessary (though it would be
better to just have it all there in some cases), but for just a simple
change I don't quite get it.

Lewis -- if every time I change something minor like http.agent.name
in a config file, will I have to rebuild?

-- Chris


On Mon, Oct 1, 2012 at 4:49 PM,  alx...@aim.com wrote:
 It seems to me that if you run nutch in deploy mode and make changes to 
 config 
files then you need to rebuild .job file again unless you specify config_dir 
option in hadoop command.

 Alex.


 -Original Message-
 From: Christopher Gross cogr...@gmail.com
 To: user user@nutch.apache.org
 Sent: Mon, Oct 1, 2012 1:22 pm
 Subject: Re: Building Nutch 2.0


 I have my 1.3 set up in a /proj/nutch/ directory that has the bin,
 conf, lib, logs, ..etc.., with NUTCH_HOME pointing there.  I don't
 quite see what the difference would be for 2.x as long as NUTCH_HOME
 pointed to the right place.

 Is there documentation anywhere on how to do a deployment?

 -- Chris


 On Mon, Oct 1, 2012 at 3:59 PM, Lewis John Mcgibbney
 lewis.mcgibb...@gmail.com wrote:
 Hi Chris,

 On Mon, Oct 1, 2012 at 8:52 PM, Christopher Gross cogr...@gmail.com wrote:
 OK, I added the port being used by hbase to iptables, and now I'm farther.

 I'm getting:
 12/10/01 19:44:17 ERROR fetcher.FetcherJob: Fetcher: No agents listed
 in 'http.agent.name' property.

 But I do have an entry there, and it matches the first in the
 robots.agents as well.

 This can only mean that you have not recompiled this stuff into the
 runtime/local directory.


 How should I have this laid out?  Should I be running out of the
 'runtime' dir, or is it fine that I've pulled all those files out and
 into a /proj/nutch-2.1/ directory (so there's a bin, conf, lib,
 ..etc.. in there, with NUTCH_HOME pointing to that dir).

 OK so you are running locally. I can't say whether its OK to copy the
 directories and their content elsewhere as I've never done it however
 I would avoid unless absolutely necessary. It terms of the directory
 layout Nutch 2.x is identical to 1.x.

 It really helps if you make explicit which back end you intend to use
 as the config may alter accordingly.



 


Re: Error parsing html

2012-10-02 Thread alxsss
Can you provide a few lines of log or the url that gives the exception?
 

 

-Original Message-
From: CarinaBambina carina.rei...@yahoo.de
To: user user@nutch.apache.org
Sent: Tue, Oct 2, 2012 2:04 pm
Subject: Re: Error parsing html


Thanks for the reply. I'm now using Nutch 1.5.1, but nothing has changed so
far.

While debugging I came across the runParser method in ParseUtil class in
which the task.get(MAX_PARSE_TIME, TimeUnit.SECONDS); returns null.
Therefore also the ParseResult object is null, which makes the program raise
the ParseException. 

Right now i have no clue what the problem could be. I also tried using all
default configurations, but nothing changed.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Error-parsing-html-tp3994699p4011495.html
Sent from the Nutch - User mailing list archive at Nabble.com.

 


Re: Error parsing html

2012-10-09 Thread alxsss
I checked the url you privided with parsechecker and they are parsed correctly. 
You can check yourself by doing bin/nutch parsechecker yoururl. In you 
implementation can you check if segment dir has correct permission.

Alex.
 
 

 

 

-Original Message-
From: CarinaBambina carina.rei...@yahoo.de
To: user user@nutch.apache.org
Sent: Tue, Oct 9, 2012 10:03 am
Subject: Re: Error parsing html


i now also tried using all source files itself instead of the nutch.jar, but
nothing changed.

Is there anyone who has an idea what the reason for this error might be? Or
at least where and what i should look for? Any hint?!

Thanks in advance!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Error-parsing-html-tp3994699p4012755.html
Sent from the Nutch - User mailing list archive at Nabble.com.

 


nutch-2.0-fetcher fails in reduce stage

2012-10-15 Thread alxsss
 

 Hello,

I try to use nutch-2.0, hadoop-1.03, hbase-0.92.1 in pseudo distributed mode 
with iptables turned off. As soon as map reaches 100%, fetcher works for a few 
minutes and fails with the error
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701)
at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:489)
at 
org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupConnection(HBaseClient.java:328)
at 
org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:362)
at 
org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1045)
at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:897)
at 
org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:150)
at $Proxy10.getClosestRowBefore(Unknown Source)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:947)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:814)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:788)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1024)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:818)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchCallback(HConnectionManager.java:1524)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1409)
at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:943)
at 
org.apache.gora.hbase.store.HBaseTableConnection.close(HBaseTableConnection.java:96)
at org.apache.gora.hbase.store.HBaseStore.close(HBaseStore.java:599)
at 
org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.java:55)
at 
org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(ReduceTask.java:579)
at 
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:650)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)

org.apache.gora.util.GoraException: 
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed setting up 
proxy interface org.apache.hadoop.hbase.ipc.HRegionInterface to 
master/192.168.1.4:60020 after attempts=1
at 
org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167)
at 
org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:118)
at 
org.apache.gora.mapreduce.GoraOutputFormat.getRecordWriter(GoraOutputFormat.java:88)
at 
org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.lt;initgt;(ReduceTask.java:569)
at 
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:638)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed 
setting up proxy interface org.apache.hadoop.hbase.ipc.HRegionInterface to 
master/192.168.1.4:60020 after attempts=1
at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:242)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1278)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1235)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1222)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:918)
at 

Re: nutch-2.0-fetcher fails in reduce stage

2012-10-17 Thread alxsss

Hello,

Today, I closely followed all hbase and hadoop logs. As soon as map reached 
100% reduce was 33%. Then when reduce reached 66% I saw in hadoop's datanode 
log the following error

2012-10-16 22:44:54,634 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(127.0.0.1:50010, 
storageID=DS-179532189-192.168.1.4-50010-1349640973409, infoPort=50075, 
ipcPort=50020):DataXceiver
java.io.EOFException: while trying to read 65557 bytes
at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:268)
at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:312)
at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:376)
at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:532)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:398)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107)
at java.lang.Thread.run(Thread.java:662)

 

 

And hbase's regionserver stopped without any errors. I do not see any errors in 
hbase master and hadoop namenode logs. 


@Lewis
Not sure what do you mean about configuration to run behind proxy. I closely 
followed hbase configuration at http://hbase.apache.org/book/configuration.html

box1 --is a local fedora linux box with dynamic ip
box2 --is a dedicated fedora server with static ip.

In box 2 fetcher runs without any errors, but the generated set is 100,000 
times less than the set in box1

Thanks in advance.
Alex.



-Original Message-
From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com
To: user user@nutch.apache.org
Sent: Tue, Oct 16, 2012 2:40 am
Subject: Re: nutch-2.0-fetcher fails in reduce stage

 

 
 

Hi Alex, 


 


I've seen similar exceptions numerous times [0] when running the Gora 


test suite against HBase however this _always_ occurred against an 


HBase version other than the officially supported version of HBase 


(which is 0.90.4) when behind a local proxy so I am immediately 


tempted to speculate that this may be the source of the problem. 


 


On Tue, Oct 16, 2012 at 3:50 AM,  alx...@aim.com wrote: 


 


 at 
 org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
  


 


 org.apache.gora.util.GoraException: 
 org.apache.hadoop.hbase.client.RetriesExhaustedException:  


Failed setting up proxy interface org.apache.hadoop.hbase.ipc.HRegionInterface  


to master/192.168.1.4:60020 after attempts=1 


 


The above two slices of the stack would also indicate that this is the case. 


 


 


 bin/nutch inject works fine. Also, I have a different linux, box. fetcher 
 with  


the same config runs fine, but the generated set is much less than in the first 
 


linux box. 


 


I don't really understand this very well it is quite ambiguous. Can 


you clearly define between box1 and box2... and which one works and 


which one doesn't? Also how are your HBase configurations across these 


boxes and how are you running Nutch? 


 


 Any ideas how to fix this issue and what is the benefit running fetcher in  


pseudo distributed mode against the local one? 


 


 


Finally, is your Nutch deployment configured to run behind a proxy? I 


know there is no mention of this but maybe there is more to this than 


simply disabling iptables! I am not however HBase literate enough to 


comment further on what configuration causes this, therefore I've 


copied in the user@ gora list as well. 


 


@user@ 


 


The original thread for this topic can be found below [1] 


 


[0] http://www.mail-archive.com/dev@gora.apache.org/msg00485.html 


[1] http://www.mail-archive.com/user@nutch.apache.org/msg07823.html 


 


hth 


 


Lewis 



 

  

 


Re: Same pages crawled more than once and slow crawling

2012-10-18 Thread alxsss
Hello,

I think the problem is with the storage not nutch itself. Looks like generate 
cannot read status or fetch time (or gets null values)  from mysql. 
I had a bunch of issues with mysql storage and switched to hbase at the end.

Alex.

 

 

 

-Original Message-
From: Sebastian Nagel wastl.na...@googlemail.com
To: user user@nutch.apache.org
Sent: Thu, Oct 18, 2012 12:08 pm
Subject: Re: Same pages crawled more than once and slow crawling


Hi Luca,

 I'm using Nutch 2.1 on Linux and I'm having similar problem of 
http://goo.gl/nrDLV, my Nutch is
 fetching same pages at each round.
Um... I failed to reproduce the Pierre's problem with
- a simpler configuration
- HBase as back-end (Pierre and Luca both use mysql)

 Then I ran bin/nutch crawl urls -threads 1

 first.htm was fetched 5 times
 second.htm was fetched 4 times
 third.htm was fetched 3 times
But after the 5th cycle the crawler stopped?

 I tried doing each step separately (inject, generate, ...) with the same 
results.
For Pierre this has worked...
Any suggestions?

 Also the whole process take about 2 minutes, am I missing something about 
 some 
delay config or is
 this normal?
Well, Nutch (resp. Hadoop) are designed to process much data. Job management 
has 
some overhead
(and some artificial sleeps): 5 cycles * 4 jobs (generate/fetch/parse/update) = 
20 jobs.
6s per job seems roughly ok, though it could be slightly faster.

Sebastian

On 10/18/2012 05:55 PM, Luca Vasarelli wrote:
 Hello,
 
 I'm using Nutch 2.1 on Linux and I'm having similar problem of 
http://goo.gl/nrDLV, my Nutch is
 fetching same pages at each round.
 
 I've built a simple localhost site, with 3 pages linked each other:
 first.htm - second.htm - third.htm
 
 I did these steps:
 
 - downloaded nutch 2.1 (source)  untarred to ${TEMP_NUTCH}
 - edited ${TEMP_NUTCH}/ivy/ivy.xml uncommenting the line about mysql backend 
(thanks to [1])
 - edited ${TEMP_NUTCH}/conf/gora.properties removing default sql 
 configuration 
and adding mysql
 properties (thanks to [1])
 - ran ant runtime from ${TEMP_NUTCH}
 - moved ${TEMP_NUTCH}/runtime/local/ to /opt/${NUTCH_HOME}
 - edited ${NUTCH_HOME}/conf/nutch-site.xml adding http.agent.name, 
http.robots.agents and changing
 db.ignore.external.links to true and fetcher.server.delay to 0.0
 - created ${NUTCH_HOME}/urls/seed.txt with http://localhost/test/first.htm; 
inside this file
 - created db table as [1]
 
 Then I ran bin/nutch crawl urls -threads 1
 
 first.htm was fetched 5 times
 second.htm was fetched 4 times
 third.htm was fetched 3 times
 
 I tried doing each step separately (inject, generate, ...) with the same 
results.
 
 Also the whole process take about 2 minutes, am I missing something about 
 some 
delay config or is
 this normal?
 
 Some extra info:
 
 - HTML of the pages: http://pastebin.com/dyDPJeZs
 - Hadoop log: http://pastebin.com/rwQQPnkE
 - nutch-site.xml: http://pastebin.com/0WArkvh5
 - Wireshark log: http://pastebin.com/g4Bg17Ls
 - MySQL table: http://pastebin.com/gD2SvGsy
 
 [1] http://nlp.solutions.asia/?p=180


 


Re: Same pages crawled more than once and slow crawling

2012-10-19 Thread alxsss
Hello,

I meant that it could be a gora-mysql problem. In order to test it, you can run 
nutch in local mode with Generator Debug enabled. Put this 
log4j.logger.org.apache.nutch.crawl.GeneratorJob=DEBUG,cmdstdout

in your conf/log4j.properties

and run the crawl cycle with updatedb. if gora-mysql works properly, then you 
must see in the output,

shouldFetch rejected  url , fetchTime  FetchTime  curTime  curTime

for those urls that were fetched in the previous cycle. If you do not see them, 
then it means gora-mysql has issues.

Good luck.
Alex.

 

 

 

-Original Message-
From: Luca Vasarelli luca.vasare...@iit.cnr.it
To: user user@nutch.apache.org
Sent: Fri, Oct 19, 2012 1:01 am
Subject: Re: Same pages crawled more than once and slow crawling


 Hi Luca,

Hi Sebastian, thanks for replying!

 But after the 5th cycle the crawler stopped?

Yes

 For Pierre this has worked...
 Any suggestions?

I can post info for each step, but please tell me which log is more 
important: Haadop log? MySQL table? If this last one, which fields?

Alex says it's a MySQL problem, how can I verify after the generate step 
if he is correct?

 Well, Nutch (resp. Hadoop) are designed to process much data. Job management 
has some overhead
 (and some artificial sleeps): 5 cycles * 4 jobs (generate/fetch/parse/update) 
= 20 jobs.
 6s per job seems roughly ok, though it could be slightly faster.

Yes, this test is not well designed for Nutch, but I thought, as Stefan 
said, about a config or hardcoded delay somewhere in the nutch files I 
can try to reduce, since I will use on a single machine.

Luca


 


Re: Image search engine based on nutch/solr

2012-10-21 Thread alxsss
Hello,

I have also written this kind of plugin. But instead of putting thumbnail files 
in solr index they are put in a folder. Only, filenames are kept in the solr 
index. 

I wondered what is the advantage of putting thumbnail files in the solr index?

Thanks in advance.
Alex.

 

 

-Original Message-
From: Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu
To: user user@nutch.apache.org
Sent: Sun, Oct 21, 2012 7:26 pm
Subject: Re: Image search engine based on nutch/solr


Hi,

As Lewis say before, if you are going to use nutch for image retrieval and 
indexing in solr, you'll need to invest some time writing some tools depending 
on your needs. I've been working on a search engine using nutch for the 
crawling 
process and solr as an indexing server, the typical use, when we start dealing 
with images we became aware that nutch (through the tike project) extract to 
few 
information about the image per se (basically only metadata, gets extracted), 
I think that this is the biggest problem with nutch. One particular requirement 
for me was to show a thumbnail of the image, so I wrote a plugin that generates 
the thumbnail, then encode it using base64 and store it in the solr index. 
Other 
need was to annotate the image with the surrounding text to improve the search, 
I also write a plugin for this.

Summarizing, nutch it's a very good start point, but depending on your 
particular needs you'll have to write some plugins on your own.

Greetings

On Oct 20, 2012, at 10:02 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com 
wrote:

 Hi,
 
 On Fri, Oct 19, 2012 at 10:48 PM, Santosh Mahto
 santosh.inb...@gmail.com wrote:
 Hi all
 
 I have few question:
 1. Does nutch support images crawling and indexing(or how much support is
 there)
 
 Depending on how you wish to process and then present your images e.g.
 as thumbnails for example, I would say you need to invest some time
 writing a custom parser for images. You can read a pretty thorough and
 comprehensive thread [0] on this topic.
 
 2. As I got some link where apache-tika plugin is used to make image search
 engine, with little exploration i found
   tikka is defaulted in nutch(as I think ,not sure) . so is image seaching
 also happens by default.
 
 Image processing and indexing is not enabled my default in the above context
 
 3. As I think i also need to configure solr to show the image result .
 could you guide me what extra configuration need to be set in solr side
 
 Unless someone here who has worked with image indexing in Solr can
 help you in a more verbose manner than me, I would certainly direct
 you to thee solr-user@ list archives [1]. There appears to be plenty
 there.
 
 hth
 
 Lewis
 
 [0] http://www.mail-archive.com/user@nutch.apache.org/msg06758.html
 [1] http://www.mail-archive.com/search?q=imagel=solr-user%40lucene.apache.org
 
 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
 CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
 
 http://www.uci.cu
 http://www.facebook.com/universidad.uci
 http://www.flickr.com/photos/universidad_uci


10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

 


Re: bin/nutch parsechecker -dumpText works but bin/nutch parse fails

2012-11-01 Thread alxsss
Hi,

I think in order to be sure that this is gora-sql problem, you need to do the 
same crawling with nutch/hbase. It must not take much time if you run it in 
local mode. Simply install hbase and follow quick start tutorial.

Alex.

 

 

 

-Original Message-
From: kiran chitturi chitturikira...@gmail.com
To: user user@nutch.apache.org
Sent: Thu, Nov 1, 2012 9:29 am
Subject: Re: bin/nutch parsechecker -dumpText works but bin/nutch parse fails


Hi,

I have created an issue (https://issues.apache.org/jira/browse/NUTCH-1487).

Do you think this is because of the SQL backend ? Its failing for PDF files
but working for HTML files.

Can the problem be due to some bug in the tika.parser code (since tika
plugin handles the PDF parsing) ?

I am interesting in fixing this problem, if i can find out where the issue
starts.

Does anyone have inputs for this ?

Thanks,
Kiran.



On Thu, Nov 1, 2012 at 10:15 AM, Julien Nioche 
lists.digitalpeb...@gmail.com wrote:

 Hi

 Yes please do open an issue. The docs should be parsed in one go and I
 suspect (yet another) issue with the SQL backend

 Thanks

 J

 On 1 November 2012 13:48, kiran chitturi chitturikira...@gmail.com
 wrote:

  Thank you alxsss for the suggestion. It displays the actualSize and
  inHeaderSize for every file and two more lines in logs but it did not
 much
  information even when i set parserJob to Debug.
 
  I had the same problem when i re-compiled everything today. I have to run
  the parse command multiple times to get all the files parsed.
 
  I am using SQL with GORA. Its mysql database.
 
  For now, atleast the files are getting parsed, do  i need to open a issue
  for this ?
 
  Thank you,
 
  Regards,
  Kiran.
 
 
  On Wed, Oct 31, 2012 at 4:36 PM, Julien Nioche 
  lists.digitalpeb...@gmail.com wrote:
 
   Hi Kiran
  
   Interesting. Which backend are you using with GORA? The SQL one? Could
  be a
   problem at that level
  
   Julien
  
   On 31 October 2012 17:01, kiran chitturi chitturikira...@gmail.com
   wrote:
  
Hi Julien,
   
I have just noticed something when running the parse.
   
First when i ran the parse command 'sh bin/nutch parse
1351188762-1772522488', the parsing of all the PDF files has failed.
   
When i ran the command again one pdf file got parsed. Next time,
  another
pdf file got parsed.
   
When i ran the parse command the number of times the total number of
  pdf
files, all the pdf files got parsed.
   
In my case,  i ran it 17 times and all the pdf files are parsed.
 Before
that, not everything is parsed.
   
This sounds strange, do you think it is some configuration problem ?
   
I have tried this 2 times and same thing happened two times for me .
   
I am not sure why this is happening.
   
Thanks for your help.
   
Regards,
Kiran.
   
   
On Wed, Oct 31, 2012 at 10:28 AM, Julien Nioche 
lists.digitalpeb...@gmail.com wrote:
   
 Hi


  Sorry about that. I did not notice the parsecodes are actually
  nutch
and
  not tika.
 
  no problems!


  The setup is local on Mac desktop and i am using through command
  line
and
  remote debugging through eclipse (
 

   
  
 
 http://wiki.apache.org/nutch/RunNutchInEclipse#Remote_Debugging_in_Eclipse
  ).
 

 OK

 
  I have set both http.content.limit and file.content.limit to -1.
  The
logs
  just say 'WARN  parse.ParseUtil - Unable to successfully parse
   content
  http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/watson.pdf of
   type
  application/pdf'.
 

 you set it in $NUTCH_HOME/runtime/local/conf/nutch-site.xml right?
  (not
 in $NUTCH_HOME/conf/nutch-site.xml unless you call 'ant clean
  runtime')


 
  All the html's are getting parsed and when i crawl this page (
  http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/), all the html's
  and
 some
  of the pdf files get parsed. Like, half of the pdf files get
 parsed
   and
 the
  other half don't get parsed.
 

 do the ones that are not parsed have something in common? length?


  I am not sure about what causing the problem as you said
  parsechecker
is
  actually work. I want the parser to crawl the full-text of the
 pdf
   and
 the
  metadata, title.
 

 OK


 
  The metatags are also getting crawled for failed pdf parsing.
 

 They would be discarded because of the failure even if they
 were successfully extracted indeed. The current mechanism does not
   cater
 for semi-failures

 J.

 --
 *
 *Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble

   
   
   
--
Kiran Chitturi
   
  
  
  
   --
   *
   *Open Source Solutions for Text Engineering

Re: Access crawled content or parsed data of previous crawled url

2012-11-28 Thread alxsss
It is not clear what you try to achieve. We have done something similar in 
regard of indexing img tags. We retrieve img tag data while parsing the html 
page  and keep it in a metadata and when parsing img url itself we create 
thumbnail.

hth.
Alex.

 

 

 

-Original Message-
From: Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu
To: user user@nutch.apache.org
Sent: Wed, Nov 28, 2012 2:58 pm
Subject: Re: Access crawled content or parsed data of previous crawled url


Any documentation about crawldb api? I'm guessing the it shouldn't be so hard 
to 
retrieve a documento by it's url (which is basically what I need. I'm also open 
to any suggestion on this matter, so If any one has done something similar or 
has any thoughts on this and can share it, I'll be very grateful.

Greetings!

- Mensaje original -
De: Stefan Scheffler sscheff...@avantgarde-labs.de
Para: user@nutch.apache.org
Enviados: Miércoles, 28 de Noviembre 2012 15:04:44
Asunto: Re: Access crawled content or parsed data of previous crawled url

Hi,
I think, this is possible, because you can write a ParserPlugin which
access the allready stored documents via the segments- /crawldb api.
But i´m not sure how it will work exactly.

Regards
Stefan

Re
Am 28.11.2012 20:59, schrieb Jorge Luis Betancourt Gonzalez:
 Hi:

 For what I've seen in nutch plugins exist the philosophy of one NutchDocument 
per url, but I was wondering if there is any way of accessing parsed/crawled 
content of a previous fetched/parsed url, let's say for instance that I've a 
HTML page with an image embedded: So the start point will be 
http://host.com/test.html which is the first document that get's fetched/parsed 
then the OutLink extractor will detect the embedded image inside test.html and 
then add the url in the src attribute of the img tag, so then the image url 
will be fetched and then parsed. My question: Is possible, when the image is 
getting parsed, to access the content and parsed data of test.html? I'm trying 
to add some data present on the HTML page as a new metadata field of the image, 
and I'm not quite sure on how to accomplish this.

 Greetings in advance!
 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
 CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

 http://www.uci.cu
 http://www.facebook.com/universidad.uci
 http://www.flickr.com/photos/universidad_uci


--
Stefan Scheffler
Avantgarde Labs GbR
Löbauer Straße 19, 01099 Dresden
Telefon: + 49 (0) 351 21590834
Email: sscheff...@avantgarde-labs.de



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

 


Re: Access crawled content or parsed data of previous crawled url

2012-11-29 Thread alxsss
Hi,

Unfortunately, my employer does not want me to disclose details of the plugin 
at this time.

Alex.

 

 

 

-Original Message-
From: Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu
To: user user@nutch.apache.org
Sent: Wed, Nov 28, 2012 6:20 pm
Subject: Re: Access crawled content or parsed data of previous crawled url


Hi Alex:

What you've done is basically what I'm try to accomplish: I'm trying to get the 
text surrounding the img tags to improve the image search engine we're building 
(this is done when the html page containing the img tag is parsed), and when 
the 
image url itself is parsed we generate thumbnails and extract some metadata. 
But 
how do you keep the this 2 pieces of data linked together inside your index 
(solr in my case). Because the thing is that I'm getting two documents inside 
solr (1. containing the text surrounding the img tag, and other document with 
the thumbnail). So what brings me troubles is how when the thumbnail is being 
generated can I get the surrounding text detecte when the html was parsed?

Thanks a lot for all the replies!

P.S: Alex, can you share some piece of code (if it's possible) of your working 
plugins? Or walk me through what you've came up with?

- Mensaje original -
De: alx...@aim.com
Para: user@nutch.apache.org
Enviados: Miércoles, 28 de Noviembre 2012 19:54:07
Asunto: Re: Access crawled content or parsed data of previous crawled url

It is not clear what you try to achieve. We have done something similar in 
regard of indexing img tags. We retrieve img tag data while parsing the html 
page  and keep it in a metadata and when parsing img url itself we create 
thumbnail.

hth.
Alex.







-Original Message-
From: Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu
To: user user@nutch.apache.org
Sent: Wed, Nov 28, 2012 2:58 pm
Subject: Re: Access crawled content or parsed data of previous crawled url


Any documentation about crawldb api? I'm guessing the it shouldn't be so hard to
retrieve a documento by it's url (which is basically what I need. I'm also open
to any suggestion on this matter, so If any one has done something similar or
has any thoughts on this and can share it, I'll be very grateful.

Greetings!

- Mensaje original -
De: Stefan Scheffler sscheff...@avantgarde-labs.de
Para: user@nutch.apache.org
Enviados: Miércoles, 28 de Noviembre 2012 15:04:44
Asunto: Re: Access crawled content or parsed data of previous crawled url

Hi,
I think, this is possible, because you can write a ParserPlugin which
access the allready stored documents via the segments- /crawldb api.
But i´m not sure how it will work exactly.

Regards
Stefan

Re
Am 28.11.2012 20:59, schrieb Jorge Luis Betancourt Gonzalez:
 Hi:

 For what I've seen in nutch plugins exist the philosophy of one NutchDocument
per url, but I was wondering if there is any way of accessing parsed/crawled
content of a previous fetched/parsed url, let's say for instance that I've a
HTML page with an image embedded: So the start point will be
http://host.com/test.html which is the first document that get's fetched/parsed
then the OutLink extractor will detect the embedded image inside test.html and
then add the url in the src attribute of the img tag, so then the image url
will be fetched and then parsed. My question: Is possible, when the image is
getting parsed, to access the content and parsed data of test.html? I'm trying
to add some data present on the HTML page as a new metadata field of the image,
and I'm not quite sure on how to accomplish this.

 Greetings in advance!
 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
INFORMATICAS...
 CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

 http://www.uci.cu
 http://www.facebook.com/universidad.uci
 http://www.flickr.com/photos/universidad_uci


--
Stefan Scheffler
Avantgarde Labs GbR
Löbauer Straße 19, 01099 Dresden
Telefon: + 49 (0) 351 21590834
Email: sscheff...@avantgarde-labs.de



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci




10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci


10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

 


Re: Native Hadoop library not loaded and Cannot parse sites contents

2013-01-04 Thread alxsss
move or copy that jar file to local/lib and try again.

hth.
Alex.

 

 

 

-Original Message-
From: Arcondo arcondo.dasi...@gmail.com
To: user user@nutch.apache.org
Sent: Fri, Jan 4, 2013 2:55 am
Subject: Re: Native Hadoop library not loaded and Cannot parse sites contents


Hope that now you can see them

Plugin folder
http://lucene.472066.n3.nabble.com/file/n4030524/plugin_folder.png 

Parse Job

http://lucene.472066.n3.nabble.com/file/n4030524/parse_job.png 

Parse error : Hadoop.log

http://lucene.472066.n3.nabble.com/file/n4030524/parse_error.png 

My nutch-site.xm (plugin includes)

property
nameplugin.includes/name
valueprotocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic/value
 descriptionRegular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin.
 By default Nutch includes crawling just HTML and plain text via HTTP,
   and basic indexing and search plugins. In order to use HTTPS please
 enable
   protocol-httpclient, but be aware of possible intermittent problems
 with the
  underlying commons-httpclient library.
  /description
 /property








--
View this message in context: 
http://lucene.472066.n3.nabble.com/Native-Hadoop-library-not-loaded-and-Cannot-parse-sites-contents-tp4029542p4030524.html
Sent from the Nutch - User mailing list archive at Nabble.com.

 


Re: Native Hadoop library not loaded and Cannot parse sites contents

2013-01-04 Thread alxsss
Which version of nutch  is this? Did you follow the tutorial? I can help yuu if 
you provide all steps you did, starting with downloading nutch.

Alex.

 

 

 

-Original Message-
From: Arcondo Dasilva arcondo.dasi...@gmail.com
To: user user@nutch.apache.org
Sent: Fri, Jan 4, 2013 1:23 pm
Subject: Re: Native Hadoop library not loaded and Cannot parse sites contents


Hi Alex,

I tried. That was the first thing I did but without success.
I don't understand why I'm obliged to use Neko instead of Tika. As far as I
know tika can parse more than 1200 different formats

Kr, Arcondo


On Fri, Jan 4, 2013 at 7:47 PM, alx...@aim.com wrote:

 move or copy that jar file to local/lib and try again.

 hth.
 Alex.







 -Original Message-
 From: Arcondo arcondo.dasi...@gmail.com
 To: user user@nutch.apache.org
 Sent: Fri, Jan 4, 2013 2:55 am
 Subject: Re: Native Hadoop library not loaded and Cannot parse sites
 contents


 Hope that now you can see them

 Plugin folder
 http://lucene.472066.n3.nabble.com/file/n4030524/plugin_folder.png

 Parse Job

 http://lucene.472066.n3.nabble.com/file/n4030524/parse_job.png

 Parse error : Hadoop.log

 http://lucene.472066.n3.nabble.com/file/n4030524/parse_error.png

 My nutch-site.xm (plugin includes)

 property
 nameplugin.includes/name

 valueprotocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic/value
  descriptionRegular expression naming plugin directory names to
   include.  Any plugin not matching this expression is excluded.
   In any case you need at least include the nutch-extensionpoints plugin.
  By default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please
  enable
protocol-httpclient, but be aware of possible intermittent problems
  with the
   underlying commons-httpclient library.
   /description
  /property








 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Native-Hadoop-library-not-loaded-and-Cannot-parse-sites-contents-tp4029542p4030524.html
 Sent from the Nutch - User mailing list archive at Nabble.com.




 


Re: Native Hadoop library not loaded and Cannot parse sites contents

2013-01-07 Thread alxsss
Hi,

You can unjar the jar file, check if the class that parse complains about is 
inside it. You can also try to put content of jar file under local /lib. Maybe 
there is some read restriction. If this does not help, I can only suggest to 
start again with a new copy of nutch.

Alex.

 

 

 

-Original Message-
From: Arcondo Dasilva arcondo.dasi...@gmail.com
To: user user@nutch.apache.org
Sent: Sat, Jan 5, 2013 1:11 am
Subject: Re: Native Hadoop library not loaded and Cannot parse sites contents


Hi Alex,

I'm using 2.1 version / hbase 0.90.6 / solr 4.0
everything works fine except I'm not able to parse the contents of my url
because of the error Nekohtml not found.

my plugins include looks like this :

valueprotocol-http|urlfilter-regex|parse-(xml|xhtml|html|tika|text|js)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|lib-nekohtml/value

I added  lib-nekohtml at the end of the allowed values but seems that has
no effect on the error.

in my runtime/local/plugins/lib-nekohtml, I have the jar file
: nekohtml-0.9.5.jar

is there something I should look for beside this ?

Thanks a lot for your help.

Kr, Arcondo


On Fri, Jan 4, 2013 at 11:33 PM, alx...@aim.com wrote:

 Which version of nutch  is this? Did you follow the tutorial? I can help
 yuu if you provide all steps you did, starting with downloading nutch.

 Alex.







 -Original Message-
 From: Arcondo Dasilva arcondo.dasi...@gmail.com
 To: user user@nutch.apache.org
 Sent: Fri, Jan 4, 2013 1:23 pm
 Subject: Re: Native Hadoop library not loaded and Cannot parse sites
 contents


 Hi Alex,

 I tried. That was the first thing I did but without success.
 I don't understand why I'm obliged to use Neko instead of Tika. As far as I
 know tika can parse more than 1200 different formats

 Kr, Arcondo


 On Fri, Jan 4, 2013 at 7:47 PM, alx...@aim.com wrote:

  move or copy that jar file to local/lib and try again.
 
  hth.
  Alex.
 
 
 
 
 
 
 
  -Original Message-
  From: Arcondo arcondo.dasi...@gmail.com
  To: user user@nutch.apache.org
  Sent: Fri, Jan 4, 2013 2:55 am
  Subject: Re: Native Hadoop library not loaded and Cannot parse sites
  contents
 
 
  Hope that now you can see them
 
  Plugin folder
  http://lucene.472066.n3.nabble.com/file/n4030524/plugin_folder.png
 
  Parse Job
 
  http://lucene.472066.n3.nabble.com/file/n4030524/parse_job.png
 
  Parse error : Hadoop.log
 
  http://lucene.472066.n3.nabble.com/file/n4030524/parse_error.png
 
  My nutch-site.xm (plugin includes)
 
  property
  nameplugin.includes/name
 
 
 valueprotocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic/value
   descriptionRegular expression naming plugin directory names to
include.  Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin.
   By default Nutch includes crawling just HTML and plain text via HTTP,
 and basic indexing and search plugins. In order to use HTTPS please
   enable
 protocol-httpclient, but be aware of possible intermittent problems
   with the
underlying commons-httpclient library.
/description
   /property
 
 
 
 
 
 
 
 
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/Native-Hadoop-library-not-loaded-and-Cannot-parse-sites-contents-tp4029542p4030524.html
  Sent from the Nutch - User mailing list archive at Nabble.com.
 
 
 




 


nutch/util/NodeWalker class is not thread safe

2013-01-16 Thread alxsss
Hello,

I use this class  NodeWalker at src/java/org/apache/nutch/util/NodeWalker.java 
in one of our plugins. I noticed this comment 
//Currently this class is not thread safe.  It is assumed that only one   
thread will be accessing the codeNodeWalker/code at any given time.
above the class definition.

Any ideas if this can cause problems and how to make it thread safe?

Thanks.
Alex.


Re: Nutch 2.0 updatedb and gora query

2013-01-30 Thread alxsss
I see that inlinks are saved as ol in hbase.

Alex.

 

 

 

-Original Message-
From: kiran chitturi chitturikira...@gmail.com
To: user user@nutch.apache.org
Sent: Wed, Jan 30, 2013 9:31 am
Subject: Re: Nutch 2.0 updatedb and gora query


Link to the reference (
http://lucene.472066.n3.nabble.com/Inlinks-not-being-saved-in-the-database-td4037067.html)
and jira (https://issues.apache.org/jira/browse/NUTCH-1524)


On Wed, Jan 30, 2013 at 12:25 PM, kiran chitturi
chitturikira...@gmail.comwrote:

 Hi,

 I have posted a similar issue in dev list [0]. The problem comes with
 inlinks not being saved to database even though they are added to the
 webpage object.

 I am curious about what happens after the fields are saved in the webpage
 object. How are they sent to Gora ? Which class is used to communicate with
 Gora ?

 I have seen Storage Utils class but i want to know if its the only class
 that is used to communicate with databases.

 Please let me know your suggestions. I feel, the inlinks are not being
 saved due to small problem in the code.



 [0] -
 http://mail-archives.apache.org/mod_mbox/nutch-dev/201301.mbox/browser

 Thanks,
 --
 Kiran Chitturi




-- 
Kiran Chitturi

 


Re: Nutch 2.0 updatedb and gora query

2013-01-30 Thread alxsss
What do you call inlinks? I call inlink for mysite.com all urls as 
mysite.com/myhtml1.html, mysite.com/myhtml2.html and etc.
Currently they are saved as ol in hbase. from hbase shell do this

 get 'webpage', 'com.mysite:http/' and check what ol family looks like.

I have these config 
property
  namedb.ignore.external.links/name
  valuetrue/value
/property
property
  namedb.ignore.internal.links/name
  valuetrue/value  
/property



Alex.

 

 

 

-Original Message-
From: kiran chitturi chitturikira...@gmail.com
To: user user@nutch.apache.org
Sent: Wed, Jan 30, 2013 11:11 am
Subject: Re: Nutch 2.0 updatedb and gora query


 I have checked the database after the dbupdate job is ran and i could see
only markers, signature and fetch fields.

The initial seed which was crawled and parsed, has only outlinks. I notice
one of the outlink is actually the inlink.

Aren't inlinks supposed to be saved during the dbUpdatedJob ? When i tried
to debug, i could see in eclipse and in the dbUpdateReducer job that the
inlinks are being saved to the page object along with fetch fields, markers
but i did not understood where the data is going from there.

Is the data written to Hbase during the dbUpdateReducer job ?

Thanks,
Kiran.




On Wed, Jan 30, 2013 at 1:43 PM, alx...@aim.com wrote:

 I see that inlinks are saved as ol in hbase.

 Alex.







 -Original Message-
 From: kiran chitturi chitturikira...@gmail.com
 To: user user@nutch.apache.org
 Sent: Wed, Jan 30, 2013 9:31 am
 Subject: Re: Nutch 2.0 updatedb and gora query


 Link to the reference (

 http://lucene.472066.n3.nabble.com/Inlinks-not-being-saved-in-the-database-td4037067.html
 )
 and jira (https://issues.apache.org/jira/browse/NUTCH-1524)


 On Wed, Jan 30, 2013 at 12:25 PM, kiran chitturi
 chitturikira...@gmail.comwrote:

  Hi,
 
  I have posted a similar issue in dev list [0]. The problem comes with
  inlinks not being saved to database even though they are added to the
  webpage object.
 
  I am curious about what happens after the fields are saved in the webpage
  object. How are they sent to Gora ? Which class is used to communicate
 with
  Gora ?
 
  I have seen Storage Utils class but i want to know if its the only class
  that is used to communicate with databases.
 
  Please let me know your suggestions. I feel, the inlinks are not being
  saved due to small problem in the code.
 
 
 
  [0] -
  http://mail-archives.apache.org/mod_mbox/nutch-dev/201301.mbox/browser
 
  Thanks,
  --
  Kiran Chitturi
 



 --
 Kiran Chitturi





-- 
Kiran Chitturi

 


Re: Nutch 1.6 +solr 4.1.0

2013-02-06 Thread alxsss
Hi,

Not sure about solrdedup, but solrindex worked for me in nutch-1.4 with 
solr-4.1.0.

Alex.

 

 

 

-Original Message-
From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com
To: user user@nutch.apache.org
Sent: Wed, Feb 6, 2013 6:13 pm
Subject: Re: Nutch 1.6 +solr 4.1.0


Hi,
We are not good to go with Solr 4.1 yet. There are changes required to
schema.xml as well as the indexer package in nutch to accommodate api
changes in 4.1.
Please check our Jira for these issues. I am happy to help with the update
however it will block some other proposed changes to the pluggable
indexers...


On Wednesday, February 6, 2013, Mustafa Elkhiat melkh...@gmail.com wrote:
 i crawl website by this command

 bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 10 -topN 10

 but i faced this exception
 how to fix it error?

 SolrIndexer: starting at 2013-02-07 03:02:07
 SolrIndexer: deleting gone documents: false
 SolrIndexer: URL filtering: false
 SolrIndexer: URL normalizing: false
 org.apache.solr.client.solrj.SolrServerException:
java.net.ConnectException: Connection refused
 SolrDeleteDuplicates: starting at 2013-02-07 03:02:29
 SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/
 Exception in thread main java.io.IOException:
org.apache.solr.client.solrj.SolrServerException:
java.net.ConnectException: Connection refused
 at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSplits(SolrDeleteDuplicates.java:200)
 at
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989)
 at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981)
 at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
 at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
 at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)
 at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)
 at org.apache.nutch.crawl.Crawl.run(Crawl.java:153)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
 Caused by: org.apache.solr.client.solrj.SolrServerException:
java.net.ConnectException: Connection refused
 at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:478)
 at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
 at
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)
 at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
 at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSplits(SolrDeleteDuplicates.java:198)
 ... 16 more
 Caused by: java.net.ConnectException: Connection refused
 at java.net.PlainSocketImpl.socketConnect(Native Method)
 at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
 at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
 at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
 at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391)
 at java.net.Socket.connect(Socket.java:579)
 at java.net.Socket.connect(Socket.java:528)
 at java.net.Socket.init(Socket.java:425)
 at java.net.Socket.init(Socket.java:280)
 at
org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80)
 at
org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122)
 at
org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707)
 at
org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387)
 at
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
 at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
 at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
 at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:422)
 ... 20 more



-- 
*Lewis*

 


Re: Nutch 2.1 + HBase cluster settings

2013-02-06 Thread alxsss
Hi,

So, you do not run hadoop and nutch job works in distributed mode?

Thanks.
Alex.

 

 

 

-Original Message-
From: k4200 k4...@kazu.tv
To: user user@nutch.apache.org
Sent: Wed, Feb 6, 2013 7:43 pm
Subject: Re: Nutch 2.1 + HBase cluster settings


Hi Lewis,

There seems to be a bug in HBase 0.90.4 library, which comes with
Nutch. I replaced hbase-0.90.4.jar with hbase-0.90.6-cdh3u5.jar and
the problem resolved.

Regards,
Kaz

2013/2/7 Lewis John Mcgibbney lewis.mcgibb...@gmail.com:
 Please let us know how you get on as we can add this to the 2.x errors
 section of the wiki.
 Thanks and good luck with the problem.
 Lewis

 On Wed, Feb 6, 2013 at 4:45 PM, k4200 k4...@kazu.tv wrote:

 Hi Lewis,

 Thanks for your reply.

 2013/2/7 Lewis John Mcgibbney lewis.mcgibb...@gmail.com:
  Hi,
 
  On Wednesday, February 6, 2013, k4200 k4...@kazu.tv wrote:
  Q1. My first question is how to fix this issue? Do I need any other
  settings fo Nutch to utilize an HBase cluster correctly?
 
  In short, I would personally shoot over the hbase lists. As you mention,
  the ZK connections have been increased but you are still experiencing
  similar results. Did you mention which HBase dist you are using?

 Sorry. I should have mentioned this in the previous email.
 I use CDH3 Update 5 on CentOS 6.3, so HBase 0.90.6 with some patches.
 I'll ask the HBase list as well.

  Q2. The second question is about Nutch and Hadoop. I didn't install
  Hadoop Job Tracker and Task Tracker because HBase itself doesn't need
  them according to a SO question [2], but does Nutch need them for some
  types of jobs?
 
  No running Hadoop in pseudo or distrib mode is not a pre requisite for
  running Nutch successfully but it can be extremely helpful not only
 because
  you get the web app navigation over job control. In the instance that
 Nutch
  is being run without Hadoop JT and TT (e.g. local mode) it simply relies
  upon the hadoop library pulled via Ivy.

 Thanks for the clarification. I'll run JT and TT.

 
  I looked for some documents or diagrams that describe
  the overall architecture of Nutch with Gora and HBase, but couldn't
  find a good one.
 
 
  Mmm. What exactly are you looking for here? We have various articles here
  [0] which explain quite a bit to get you started. Inevitably there is no
  substitute better than looking into the code and unfortunately we don't
  have any diagrams as such.
  One resources which may be of interest (regarding the Gora API and
 relevant
  layers) can be found in last years GSoC project reports [1]. There are
 some
  Gora architecture class diagrams available there, however I warn that
  (latterly) they introduce the Gora Web Services API which was written
 into
  the current 0.3 development code.

 Thanks for the pointers. And, you're right. I'll look into the code, too.

 Thanks,
 Kaz

 
  hth somewhat though.
  Lewis
 
  [0] http://wiki.apache.org/nutch/#Nutch_2.x
  [1] http://svn.apache.org/repos/asf/gora/committers/reporting/




 --
 *Lewis*

 


Re: Nutch identifier while indexing.

2013-02-13 Thread alxsss
Are you telling that your sites have form siteA.mydomain.com, 
siteB.mydomain.com, siteC.mydomain.com?

Alex.

 

 

 

-Original Message-
From: mbehlok m_beh...@hotmail.com
To: user user@nutch.apache.org
Sent: Wed, Feb 13, 2013 11:05 am
Subject: Nutch identifier while indexing.


Hello, I am indexing 3 sites:

SiteA
SiteB
SiteC

I want to index these sites in a way that when searching them in solr I can
query a search on each of these sites in separate. So one could say... thats
easy, just filter them by host... WRONG...  Sites are hosted on the same
host but have different starting points. That is, starting the crawl from
different root urls (SiteA, SiteB, SiteC) produces different results. My
imagination tells me to somehow specify an identifier on schema.xml that
passes to solr which was the root url that produced that crawl. Any ideas on
how to implement this? any variations?

Mitch 
 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-identifier-while-indexing-tp4040285.html
Sent from the Nutch - User mailing list archive at Nabble.com.

 


nutch cannot retrive title and inlinks of a domain

2013-02-13 Thread alxsss
Hello,

I noticed that nutch cannot retrieve title and inlinks of one of the domains in 
the seed list. However, if I run identical code from the server where this 
domain is hosted then it correctly parses it. The surprising thing is that in 
both cases this urls has

status: 2 (status_fetched)
parseStatus:success/ok (1/0), args=[]


I used nutch-2.1 with hbase-0.92.1 and nutch 1.4.


Any ideas why this happens?

Thanks.

Alex. 


Re: nutch cannot retrive title and inlinks of a domain

2013-02-13 Thread alxsss
Hi,

I noticed that for other urls in the seed inlinks are saved as ol. I checked 
the code and figured out that this is done with the part that saves anchors. 
So, in my case inlinks are saved as anchors in the field ol in hbase. But, for 
one of the ulrs, titile and inlinks are not retrieved, although its parse 
status marked success/ok (1/0), args=[]. 

Alex.

 

 

 

-Original Message-
From: kiran chitturi chitturikira...@gmail.com
To: user user@nutch.apache.org
Sent: Wed, Feb 13, 2013 12:40 pm
Subject: Re: nutch cannot retrive title and inlinks of a domain


Hi Alex,

Inlinks does not work with me now for the same domain [0] currently. I am
using Nutch-2.x and Hbase. Does the inlinks get saved for you for some of
the crawl seeds ?

Surprising, the title does not get saved. Did you try using parsechecker ?


[0] - http://www.mail-archive.com/user@nutch.apache.org/msg08627.html


On Wed, Feb 13, 2013 at 3:26 PM, alx...@aim.com wrote:

 Hello,

 I noticed that nutch cannot retrieve title and inlinks of one of the
 domains in the seed list. However, if I run identical code from the server
 where this domain is hosted then it correctly parses it. The surprising
 thing is that in both cases this urls has

 status: 2 (status_fetched)
 parseStatus:success/ok (1/0), args=[]


 I used nutch-2.1 with hbase-0.92.1 and nutch 1.4.


 Any ideas why this happens?

 Thanks.

 Alex.




-- 
Kiran Chitturi

 


fields in solrindex-mapping.xml

2013-02-14 Thread alxsss
Hello,

I see that there are 

field dest=segment source=segment/
field dest=boost source=boost/
field dest=digest source=digest/
field dest=tstamp source=tstamp/

fields in addition to title, host and content ones in nutch-2.x' 
solr-mapping.xml. I thought tstamp may be needed for sorting documents. What 
about the other fields,
segment, boost and digest? Can someone explain, why these fields are included 
in solr-mapping.xml?


Thanks.
Alex.




  1   2   >