Hello,
I have build mp3 parser and put it in C:\nutch\plugins . However, nutch does
not find mp3's. I checked C:\Tomcat\webapps\ROOT\WEB-INF\classes\plugins dir.
There is no parser-mp3 folder.
Any idea how to fix this?
Thanks.
Alex.
Hi All,
I have in nutch/conf/nutch-default.xml the following
property
? nameplugin.includes/name
?
valuenutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html|js|mp3)|index-(basic|more)|query-(basic|more|site|url)|summary-basic|scoring-opic/value
...
However in
I have this file file:///C:/nutch/plugins/parse-mp3/jid3lib-0.5.4.jar
-Original Message-
From: Hasan Diwan [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Sent: Tue, 11 Dec 2007 6:45 pm
Subject: Re: problem with mp3 parser
Think you may need the jar file in
It did not help. Also I checked the search.dir value does not change in
C:\Tomcat\webapps\ROOT\WEB-INF\classes\nutch-default.xml although I changed it
in nutch/conf/nutch-deafult.xml. Should the size of nutch*.war file to change
depending on how many sites are fetched. Also if I out all
Thanks for your comment. I had all of these except I had
runtime
library name=parse-mp3.jar
export name=*/
/library
library name=jid3lib-0.5.1.jar/
/runtime
instead
jid3lib-0.5.4.jar that I used. I corrected it, but still did not get mp3
plugin in
Unfortunately, my computer is not available remotely. What does offlist mean?
thanks.
Alex.
-Original Message-
From: Hasan Diwan [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Sent: Wed, 12 Dec 2007 1:05 pm
Subject: Re: problem with mp3 parser
On 12/12/2007,
I parsed a few sites with pdf files. Then added one more site to urls file.
Now, nutch does not parse pdf's at all.
Any ideas what is wrong.
Thanks.
Alex.
More new features than ever. Check out the new AIM(R) Mail ! -
Hello,
Do you have enough space. I noticed that nutch downloads content of those pages
and uses it as cashed version. Try do disable cashing.
I fetched a couple of pages and my data file is already about 8MB.
Alex.
-Original Message-
From: Josh Attenberg [EMAIL PROTECTED]
To:
Hi,
Do you recommended something other than nutch?
Thanks.
Alex.
-Original Message-
From: Karol Rybak [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Sent: Fri, 4 Jan 2008 4:15 am
Subject: Re: Nutch - crashed during a large fetch, how to restart?
Hi there i had
Hello all,
I am using nutch 9 and when I fetch a couple of sites nutch does not include
pages other that the main one.
For example, if I have mysite.com/cv.htm, nutch fetches only mysite.com. It
does not fetch cv.htm and other files in the site.
I noticed that if I do? bin/nutch generate
Hi,
In my urls file I have mysite.com and this site has links to all files like
cv.htm mypaper.pdf and etc.
Thanks.
Alex.
-Original Message-
From: Susam Pal [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Sent: Wed, 9 Jan 2008 8:34 pm
Subject: Re: some crawl problems
Hi,
I did not understand.? Instead of jid3lib-0.5.4.jar which jar file do you
recommend?
A.
-Original Message-
From: Brian Whitman [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Sent: Fri, 18 Jan 2008 2:40 pm
Subject: Re: Help with parse-mp3?
On Jan 17, 2008,
Unfortunately, I am not familiar with it. Can you give us more info? about it.
Thanks.
Alex.
-Original Message-
From: Brian Whitman [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Sent: Fri, 18 Jan 2008 3:54 pm
Subject: Re: Help with parse-mp3?
On Jan 18, 2008, at
Can you please let me know how to set nutch working on 2 or more machines.
Thanks.
Alex.
-Original Message-
From: [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org; [EMAIL PROTECTED]
Sent: Sun, 20 Jan 2008 9:57 pm
Subject: Crawl taking too much time
hi...
hi im
Hi,
Which article? Do you have a link?
Thanks.
A.
-Original Message-
From: [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Sent: Mon, 21 Jan 2008 9:34 pm
Subject: RE: Crawl taking too much time
Hi
Did u go thru the article in wiki?
Thanx
kishore
-Original
How this FreeGenerator works?
Thanks.
Alex.
-Original Message-
From: Barry Haddow [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Sent: Thu, 14 Feb 2008 8:31 am
Subject: crawl stops at depth 1
Hi
I'm trying to get a nutch crawl to work, and it keeps stopping at depth
Hi,
Can you specify how? those prices get pulled out from different sites?
Thanks.
Alex.
-Original Message-
From: Willson Chan [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Sent: Wed, 7 May 2008 12:31 am
Subject: How to gather product info from internet with Nutch?
I am interested.
-Original Message-
From: Dennis Kubes [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Sent: Fri, 28 Nov 2008 2:26 am
Subject: Nutch Training Seminar
Would anybody be interested in a Nutch training seminar that goes over
the following:?
?
1)
Hello,
I am using nutch0.9 to index files. However, nutch spends less than 1 sec to
fetch those files and gives
java.lang.NullPointerException.
As I see from the plugin's code nutch downloads content to a temp file and then
parses it. So the problem is that nutch does not download the whole
Hello,
I use nutch-0.9 and try to index urls with ? and symbols. I have commented
this line? -[...@=] in conf/crawl-urlfilter.txt, conf/automaton-urlfilter and
conf/regex-urlfilter.txt files.
However nutch still ignores these urls.
Does anyone know how this can be fixed?
Thanks in advance.
Hello,
I have one specific domain. I tested further and it looks like nutch? fetches
this domain's other links but the ones with ?. Also nutch fetches other domains
with ? symbol.
How to know if robots.txt on this domain blocks this specific links to be
fetched?
Thanks.
A.
I was trying to fetch one specific url with ? symbol and nutch was refusing to
fetch it. But if I fetch domain itself, nutch fetched links with ? symbol also.
Now, I noticed that nutch did not fetch all files on this given domain. But if
I direct nutch to an unfetched? file's? url it
Hello,
I use nutch-0.9 and need to index about 1? domains.? I want to know?
minimum requirements to hardware and memory.
Thanks in advance.
Alex.
Hi,
Thanks for the reply. I have list? of those domains only. I am not sure how
many pages they have. Is a DSL? connection sufficient to run nutch in my case.
Did you run nutch for all of your pages at once or separately for a given
subset of them. Btw, yesterday I tried to use merge shell
Hi,
I will need to index all links in domains then. What do you think a linux box
like yours with DSL connection is OK to index the domains I have?
Why only segments? I thought we need to merge all sub folders under crawl
folder. What did you use for merging them?
Thanks.
A.
Hi,
I also noticed that we can disable storing content of pages which I use. I
wonder why someone needs to store content? Also, in case of files, is there a
way to tell nutch not to download the whole file but let say 1000 bytes from
the beginning and parse and index information only in that
I never tried to test this configuration. What about asking nutch to download
a certain amount of byes from the end of files?
-Original Message-
From: Jasper Kamperman jasper.kamper...@openwaternet.com
To: nutch-user@lucene.apache.org
Sent: Tue, 3 Mar 2009 8:32 pm
Subject: Re:
What if we ask it to download 1000 bytes from beginning and the same amount
from the end and ignore the rest??
I need this to index mp3 files, since their data are either at the top or end.
My goal is to? have nutch not to spend time to download whole files.
Thanks.
A.
-Original
, 5 Mar 2009 1:24 pm
Subject: Re: what is needed to index for about 1 domains
Hi Alxsss,
How can we disable storing of contents of pages?
Regards,
Mayank.
On Wed, Mar 4, 2009 at 9:57 AM, alx...@aim.com wrote:
Hi,
I also noticed that we can disable storing content of pages
Hello,
I used? lukeall-0.9.1.jar to manually add a new? record? to index? produced?
by? nutch-0.9.? I? added only url and title fields, since I was not sure what
to put on the other fields. Now for? search of any word I get this error
HTTP Status 500 -
type Exception report
message
Hi,
I use nutch-0.9.? I downloaded lukeall-0.9.1.jar file from
http://www.getopt.org/luke/ and doube click it in windows. That website says?
It uses the official Lucene 2.4.0 release JARs
Thanks.
Alex.
-Original Message-
From: Lyndon Maydwell maydw...@gmail.com
To:
btw, which version of lucene is in nutch-0.9?
Thanks.
Alex.
-Original Message-
From: Lyndon Maydwell maydw...@gmail.com
To: nutch-user@lucene.apache.org
Sent: Fri, 13 Mar 2009 5:14 pm
Subject: Re: error after adding indexes manually
What versions of Lucene are Nutch
I opened lukeall-0.9.1.jar file and replace org/apache/lucene with
org/apache/lucene of? lucene-core-2.1.0.jar file and build a new likeall-0.9.2.
jar. Now, when I double click it it says Failed to load Main-Class manifest
attribute from lukeall-0.9.2.jar
Thanks.
Alex.
-Original
comment this line in -[...@=] in crawl-urlfilter.txt
Alex.
-Original Message-
From: MyD myd.ro...@googlemail.com
To: nutch-user@lucene.apache.org
Sent: Thu, 19 Mar 2009 6:14 am
Subject: Re: Nutch doesn't find all urls.. Any suggestion?
I may have to say that in the
I think you must put this
mycity.gov/water in your crawl-urlfliter.txt file.
Alex.
-Original Message-
From: Robert Edmiston robert.edmis...@gmail.com
To: nutch-user@lucene.apache.org
Sent: Thu, 26 Mar 2009 1:32 pm
Subject: Limiting crawls to subwebs
I am trying to
Hello,
I used lukeall-0.9.1 to manually add a document to indexes generated by
nutch-1.0. However, in search the manually added documents do not show up.
Thanks for any suggestions.
A.
Thanks for you response. In
luke there is also option to commit. I opened new index again, and
there is the document I created. But the search does not return
anything for the added keywords. Will try Solr if it works.
alxsss is misleading - there is no commit() operation in Nutch.
Also, the index doesn't have to be optimized. The most likely reason why
the added document is not visible is that Nutch also needs a
corresponding record in the segments/... data. This is not possible to
create separately, you need
Are not EC2 virtual hosts. I had problem with speed in my virtual hosts in
local linux box.
What is preferable, a dedicated server or an EC2?
-Original Message-
From: Jack Yu jackyu...@gmail.com
To: nutch-user@lucene.apache.org
Sent: Thu, 2 Apr 2009 6:54 pm
Subject: Re:
Hello,
I just heard that nutch-1.0 has solr integration. Is there any tutorials on how
to add data to nutch-1.0 indexes using solr manually?
Thanks.
Alex.
I went through that page. But when I try to add indexes manually using
curl http://localhost:8983/solr/update -H Content-Type: text/xml
--data-binary 'commit waitFlush=false waitSearcher=false/'
?xml version=1.0 encoding=UTF-8?
response
lst name=responseHeaderint name=status0/intint
the add request is like this
curl http://localhost:8983/solr/update -H Content-Type: text/xml --
data-binary 'add
doc boost=2.5
field name=segment20090512170318/field
field name=digest86937aaee8e748ac3007ed8b66477624/field
field name=boost0.21189615/field
field name=urltest.com/field
Hi,
Is it available on the internet? If not, could you please attach it.
Thanks.
A.
-Original Message-
From: Jake Jacobson jakecjacob...@gmail.com
To: nutch-user@lucene.apache.org
Sent: Mon, Jul 13, 2009 1:26 pm
Subject: Nutch Tutorial 1.0 based off of the French Version
Hi,
As I understood only indexing part of nutch is in C++ as clucene.? I want to
code? nutch in C++, only in case if it is worth doing that.? I wondered if is
worth coding the remaining parts of nutch in C++, let say the crawler. Can
someone give me directions on what to start.
Thanks
Alex.
Hi,
I know nutch uses Lucene. But for what is Clucene then? Only for indexing files
in a hard drive?
I have knowledge of C++ and some experience. I wanted to code crawler of Nutch
in C++ to get more experience and make it open source, only if it l be useful
for the open source
Hi,
The plugin is enabled in nutch-default.xml file, but changes in it did not
affect search. Instead changes in crawl-urlfilter.txt takes changes fetched
links.
Thanks.
Alex.
-Original Message-
From: Paul Tomblin ptomb...@xcski.com
To: nutch-user@lucene.apache.org
Sent:
Thanks for your comments. Is there anything that I code in C++ that open
source community could benefit?
Alex.
-Original Message-
From: Otis Gospodnetic ogjunk-nu...@yahoo.com
To: nutch-user@lucene.apache.org
Sent: Tue, Aug 4, 2009 6:54 am
Subject: Re: Nutch in C++
Hello,
I try to paginate results obtained by using opensearch rss. To do this I need
totalResults in the rss feed that comes as
opensearch:totalResults100/opensearch:totalResults
However, in php's simple_xml_load file results I do not see this part of the
feed. Does someone know how to get
Hi,
I have read a few tutorials on running Nutch to crawl web. However, I still do
not understand the meaning of topN variable in crawl command. In tutorials it
is suggested to create 3 segments and fetch them with topN=1000. What if I
create 100 segments or only one. What would be
Thanks. What if urls in my seed file do not have outlinks, let say .pdf files.
Should I still specify topN variable? All I need is to index all urls in my
seed file. And they are about 1 M.
Alex.
-Original Message-
From: Kirby Bohling kirby.bohl...@gmail.com
To:
In the tutroial on the wiki the depth is not specified and topN=1000. I run
those commands yesterday and it is still running. Will it index all my urls? My
seed file has about 20K urls.
Thanks.
Alex.
-Original Message-
From: Marko Bauhardt m...@101tec.com
To:
Hi,
After merge of two segments failed with no space available error, I deleted all
tmp folders. Now any attempt to use merge or crawl says
org.apache.hadoop.util.Shell$ExitCodeException: chmod:
/private/tmp/hadoop-root/mapred/system/job_local_0001: No such file or directory
Is there any
What local mode mean? I was running nutch merge command on my MacPro. I
created those folders and set permission then I restarted my laptop and it does
not start. When merge command was working, I noticed that free available space
was only 1kb. Does this means that merge destroyed my laptop's
Hello,
?I have run merge script? to merge two crawl dirs, one 1.6G another 120MB. But
my MacPro with 50G free space did not start, after merge crashed with no space
error. I have been told that OSX got corrupted.
I looked inside my nutch-1.0/conf/hadoop-site.xml file and it is empty. Can
Hello,
I have a crawl folder with 2GB data and its index is 160MB. Then, nutch indexed
another set of domains and its crawl folder is about 1MB. I wondered if there
is an effective way making available for search indexes from both folders
without using merge script, since merging large
Hello,
I have indexed a lot mp3 files using nutch 1.0. Now, for search in the command
line and tomcat for one keyword it gives unrelated records.
For example for keyword beyonce search gives all mp3 that have beyonce in the
id3 tags and a lot of unrelated files that absolutely does not have
Hello,
I am planning to index websites with german and turkish like symboils, like
latin letters with dots over them and etc. Which plugins should I activate?
Also I wondered how to use nutch as google or yahoo crawlers. I see google
crawler in our apache logs every other minute. It follows
57 matches
Mail list logo