from:"alxsss"

problem with mp3 parser

2007-12-10 Thread alxsss

Hello, I have build mp3 parser and put it in C:\nutch\plugins . However, nutch does not find mp3's. I checked C:\Tomcat\webapps\ROOT\WEB-INF\classes\plugins dir. There is no parser-mp3 folder. Any idea how to fix this? Thanks. Alex.

problem with mp3 parser

2007-12-11 Thread alxsss

Hi All, I have in nutch/conf/nutch-default.xml the following property ? nameplugin.includes/name ? valuenutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html|js|mp3)|index-(basic|more)|query-(basic|more|site|url)|summary-basic|scoring-opic/value ... However in

Re: problem with mp3 parser

2007-12-12 Thread alxsss

I have this file file:///C:/nutch/plugins/parse-mp3/jid3lib-0.5.4.jar -Original Message- From: Hasan Diwan [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Tue, 11 Dec 2007 6:45 pm Subject: Re: problem with mp3 parser Think you may need the jar file in

Re: problem with mp3 parser

2007-12-12 Thread alxsss

It did not help. Also I checked the search.dir value does not change in C:\Tomcat\webapps\ROOT\WEB-INF\classes\nutch-default.xml although I changed it in nutch/conf/nutch-deafult.xml. Should the size of nutch*.war file to change depending on how many sites are fetched. Also if I out all

Re: problem with mp3 parser

2007-12-12 Thread alxsss

Thanks for your comment. I had all of these except I had runtime library name=parse-mp3.jar export name=*/ /library library name=jid3lib-0.5.1.jar/ /runtime instead jid3lib-0.5.4.jar that I used. I corrected it, but still did not get mp3 plugin in

Re: problem with mp3 parser

2007-12-12 Thread alxsss

Unfortunately, my computer is not available remotely. What does offlist mean? thanks. Alex. -Original Message- From: Hasan Diwan [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Wed, 12 Dec 2007 1:05 pm Subject: Re: problem with mp3 parser On 12/12/2007,

pdf parsing

2007-12-21 Thread alxsss

I parsed a few sites with pdf files. Then added one more site to urls file. Now, nutch does not parse pdf's at all. Any ideas what is wrong. Thanks. Alex. More new features than ever. Check out the new AIM(R) Mail ! -

Re: Nutch - crashed during a large fetch, how to restart?

2007-12-30 Thread alxsss

Hello, Do you have enough space. I noticed that nutch downloads content of those pages and uses it as cashed version. Try do disable cashing. I fetched a couple of pages and my data file is already about 8MB. Alex. -Original Message- From: Josh Attenberg [EMAIL PROTECTED] To:

Re: Nutch - crashed during a large fetch, how to restart?

2008-01-04 Thread alxsss

Hi, Do you recommended something other than nutch? Thanks. Alex. -Original Message- From: Karol Rybak [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Fri, 4 Jan 2008 4:15 am Subject: Re: Nutch - crashed during a large fetch, how to restart? Hi there i had

some crawl problems

2008-01-09 Thread alxsss

Hello all, I am using nutch 9 and when I fetch a couple of sites nutch does not include pages other that the main one. For example, if I have mysite.com/cv.htm, nutch fetches only mysite.com. It does not fetch cv.htm and other files in the site. I noticed that if I do? bin/nutch generate

Re: some crawl problems

2008-01-10 Thread alxsss

Hi, In my urls file I have mysite.com and this site has links to all files like cv.htm mypaper.pdf and etc. Thanks. Alex. -Original Message- From: Susam Pal [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Wed, 9 Jan 2008 8:34 pm Subject: Re: some crawl problems

Re: Help with parse-mp3?

2008-01-18 Thread alxsss

Hi, I did not understand.? Instead of jid3lib-0.5.4.jar which jar file do you recommend? A. -Original Message- From: Brian Whitman [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Fri, 18 Jan 2008 2:40 pm Subject: Re: Help with parse-mp3? On Jan 17, 2008,

Re: Help with parse-mp3?

2008-01-18 Thread alxsss

Unfortunately, I am not familiar with it. Can you give us more info? about it. Thanks. Alex. -Original Message- From: Brian Whitman [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Fri, 18 Jan 2008 3:54 pm Subject: Re: Help with parse-mp3? On Jan 18, 2008, at

Re: Crawl taking too much time

2008-01-21 Thread alxsss

Can you please let me know how to set nutch working on 2 or more machines. Thanks. Alex. -Original Message- From: [EMAIL PROTECTED] To: nutch-user@lucene.apache.org; [EMAIL PROTECTED] Sent: Sun, 20 Jan 2008 9:57 pm Subject: Crawl taking too much time hi... hi im

Re: Crawl taking too much time

2008-01-22 Thread alxsss

Hi, Which article? Do you have a link? Thanks. A. -Original Message- From: [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Mon, 21 Jan 2008 9:34 pm Subject: RE: Crawl taking too much time Hi Did u go thru the article in wiki? Thanx kishore -Original

Re: crawl stops at depth 1

2008-02-14 Thread alxsss

How this FreeGenerator works? Thanks. Alex. -Original Message- From: Barry Haddow [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Thu, 14 Feb 2008 8:31 am Subject: crawl stops at depth 1 Hi I'm trying to get a nutch crawl to work, and it keeps stopping at depth

Re: How to gather product info from internet with Nutch?

2008-05-07 Thread alxsss

Hi, Can you specify how? those prices get pulled out from different sites? Thanks. Alex. -Original Message- From: Willson Chan [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Wed, 7 May 2008 12:31 am Subject: How to gather product info from internet with Nutch?

Re: Nutch Training Seminar

2008-11-29 Thread alxsss

I am interested. -Original Message- From: Dennis Kubes [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Fri, 28 Nov 2008 2:26 am Subject: Nutch Training Seminar Would anybody be interested in a Nutch training seminar that goes over the following:? ? 1)

java.lang.NullPointerException

2009-02-27 Thread alxsss

Hello, I am using nutch0.9 to index files. However, nutch spends less than 1 sec to fetch those files and gives java.lang.NullPointerException. As I see from the plugin's code nutch downloads content to a temp file and then parses it. So the problem is that nutch does not download the whole

urls with ? and symbols

2009-02-28 Thread alxsss

Hello, I use nutch-0.9 and try to index urls with ? and symbols. I have commented this line? -[...@=] in conf/crawl-urlfilter.txt, conf/automaton-urlfilter and conf/regex-urlfilter.txt files. However nutch still ignores these urls. Does anyone know how this can be fixed? Thanks in advance.

Re: urls with ? and symbols

2009-03-02 Thread alxsss

Hello, I have one specific domain. I tested further and it looks like nutch? fetches this domain's other links but the ones with ?. Also nutch fetches other domains with ? symbol. How to know if robots.txt on this domain blocks this specific links to be fetched? Thanks. A.

Re: urls with ? and symbols

2009-03-02 Thread alxsss

I was trying to fetch one specific url with ? symbol and nutch was refusing to fetch it. But if I fetch domain itself, nutch fetched links with ? symbol also. Now, I noticed that nutch did not fetch all files on this given domain. But if I direct nutch to an unfetched? file's? url it

what is needed to index for about 10000 domains

2009-03-03 Thread alxsss

Hello, I use nutch-0.9 and need to index about 1? domains.? I want to know? minimum requirements to hardware and memory. Thanks in advance. Alex.

Re: what is needed to index for about 10000 domains

2009-03-03 Thread alxsss

Hi, Thanks for the reply. I have list? of those domains only. I am not sure how many pages they have. Is a DSL? connection sufficient to run nutch in my case. Did you run nutch for all of your pages at once or separately for a given subset of them. Btw, yesterday I tried to use merge shell

Re: what is needed to index for about 10000 domains

2009-03-03 Thread alxsss

Hi, I will need to index all links in domains then. What do you think a linux box like yours with DSL connection is OK to index the domains I have? Why only segments? I thought we need to merge all sub folders under crawl folder. What did you use for merging them? Thanks. A.

Re: what is needed to index for about 10000 domains

2009-03-03 Thread alxsss

Hi, I also noticed that we can disable storing content of pages which I use. I wonder why someone needs to store content? Also, in case of files, is there a way to tell nutch not to download the whole file but let say 1000 bytes from the beginning and parse and index information only in that

Re: what is needed to index for about 10000 domains

2009-03-03 Thread alxsss

I never tried to test this configuration. What about asking nutch to download a certain amount of byes from the end of files? -Original Message- From: Jasper Kamperman jasper.kamper...@openwaternet.com To: nutch-user@lucene.apache.org Sent: Tue, 3 Mar 2009 8:32 pm Subject: Re:

Re: what is needed to index for about 10000 domains

2009-03-03 Thread alxsss

What if we ask it to download 1000 bytes from beginning and the same amount from the end and ignore the rest?? I need this to index mp3 files, since their data are either at the top or end. My goal is to? have nutch not to spend time to download whole files. Thanks. A. -Original

Re: what is needed to index for about 10000 domains

2009-03-05 Thread alxsss

, 5 Mar 2009 1:24 pm Subject: Re: what is needed to index for about 1 domains Hi Alxsss, How can we disable storing of contents of pages? Regards, Mayank. On Wed, Mar 4, 2009 at 9:57 AM, alx...@aim.com wrote: Hi, I also noticed that we can disable storing content of pages

error after adding indexes manually

2009-03-13 Thread alxsss

Hello, I used? lukeall-0.9.1.jar to manually add a new? record? to index? produced? by? nutch-0.9.? I? added only url and title fields, since I was not sure what to put on the other fields. Now for? search of any word I get this error HTTP Status 500 - type Exception report message

Re: error after adding indexes manually

2009-03-13 Thread alxsss

Hi, I use nutch-0.9.? I downloaded lukeall-0.9.1.jar file from http://www.getopt.org/luke/ and doube click it in windows. That website says? It uses the official Lucene 2.4.0 release JARs Thanks. Alex. -Original Message- From: Lyndon Maydwell maydw...@gmail.com To:

Re: error after adding indexes manually

2009-03-13 Thread alxsss

btw, which version of lucene is in nutch-0.9? Thanks. Alex. -Original Message- From: Lyndon Maydwell maydw...@gmail.com To: nutch-user@lucene.apache.org Sent: Fri, 13 Mar 2009 5:14 pm Subject: Re: error after adding indexes manually What versions of Lucene are Nutch

Re: error after adding indexes manually

2009-03-13 Thread alxsss

I opened lukeall-0.9.1.jar file and replace org/apache/lucene with org/apache/lucene of? lucene-core-2.1.0.jar file and build a new likeall-0.9.2. jar. Now, when I double click it it says Failed to load Main-Class manifest attribute from lukeall-0.9.2.jar Thanks. Alex. -Original

Re: Nutch doesn't find all urls.. Any suggestion?

2009-03-19 Thread alxsss

comment this line in -[...@=] in crawl-urlfilter.txt Alex. -Original Message- From: MyD myd.ro...@googlemail.com To: nutch-user@lucene.apache.org Sent: Thu, 19 Mar 2009 6:14 am Subject: Re: Nutch doesn't find all urls.. Any suggestion? I may have to say that in the

Re: Limiting crawls to subwebs

2009-03-26 Thread alxsss

I think you must put this mycity.gov/water in your crawl-urlfliter.txt file. Alex. -Original Message- From: Robert Edmiston robert.edmis...@gmail.com To: nutch-user@lucene.apache.org Sent: Thu, 26 Mar 2009 1:32 pm Subject: Limiting crawls to subwebs I am trying to

lukeall-0.9.1 to manually add indexes

2009-03-29 Thread alxsss

Hello, I used lukeall-0.9.1 to manually add a document to indexes generated by nutch-1.0. However, in search the manually added documents do not show up. Thanks for any suggestions. A.

Re: lukeall-0.9.1 to manually add indexes

2009-03-31 Thread alxsss

Thanks for you response. In luke there is also option to commit. I opened new index again, and there is the document I created. But the search does not return anything for the added keywords. Will try Solr if it works.

Re: lukeall-0.9.1 to manually add indexes

2009-04-01 Thread alxsss

alxsss is misleading - there is no commit() operation in Nutch. Also, the index doesn't have to be optimized. The most likely reason why the added document is not visible is that Nutch also needs a corresponding record in the segments/... data. This is not possible to create separately, you need

Re: nutch/hadoop performance and optimal configuration

2009-04-03 Thread alxsss

Are not EC2 virtual hosts. I had problem with speed in my virtual hosts in local linux box. What is preferable, a dedicated server or an EC2? -Original Message- From: Jack Yu jackyu...@gmail.com To: nutch-user@lucene.apache.org Sent: Thu, 2 Apr 2009 6:54 pm Subject: Re:

nutch-1.0 with solr

2009-05-12 Thread alxsss

Hello, I just heard that nutch-1.0 has solr integration. Is there any tutorials on how to add data to nutch-1.0 indexes using solr manually? Thanks. Alex.

Re: nutch-1.0 with solr

2009-05-13 Thread alxsss

I went through that page. But when I try to add indexes manually using curl http://localhost:8983/solr/update -H Content-Type: text/xml --data-binary 'commit waitFlush=false waitSearcher=false/' ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint

Re: nutch-1.0 with solr

2009-05-13 Thread alxsss

the add request is like this curl http://localhost:8983/solr/update -H Content-Type: text/xml -- data-binary 'add doc boost=2.5 field name=segment20090512170318/field field name=digest86937aaee8e748ac3007ed8b66477624/field field name=boost0.21189615/field field name=urltest.com/field

Re: Nutch Tutorial 1.0 based off of the French Version

2009-07-13 Thread alxsss

Hi, Is it available on the internet? If not, could you please attach it. Thanks. A. -Original Message- From: Jake Jacobson jakecjacob...@gmail.com To: nutch-user@lucene.apache.org Sent: Mon, Jul 13, 2009 1:26 pm Subject: Nutch Tutorial 1.0 based off of the French Version

Nutch in C++

2009-07-30 Thread alxsss

Hi, As I understood only indexing part of nutch is in C++ as clucene.? I want to code? nutch in C++, only in case if it is worth doing that.? I wondered if is worth coding the remaining parts of nutch in C++, let say the crawler. Can someone give me directions on what to start. Thanks Alex.

Re: Nutch in C++

2009-08-03 Thread alxsss

Hi, I know nutch uses Lucene. But for what is Clucene then? Only for indexing files in a hard drive? I have knowledge of C++ and some experience. I wanted to code crawler of Nutch in C++ to get more experience and make it open source, only if it l be useful for the open source

Re: how to exclude some external links

2009-08-03 Thread alxsss

Hi, The plugin is enabled in nutch-default.xml file, but changes in it did not affect search. Instead changes in crawl-urlfilter.txt takes changes fetched links. Thanks. Alex. -Original Message- From: Paul Tomblin ptomb...@xcski.com To: nutch-user@lucene.apache.org Sent:

Re: Nutch in C++

2009-08-04 Thread alxsss

Thanks for your comments. Is there anything that I code in C++ that open source community could benefit? Alex. -Original Message- From: Otis Gospodnetic ogjunk-nu...@yahoo.com To: nutch-user@lucene.apache.org Sent: Tue, Aug 4, 2009 6:54 am Subject: Re: Nutch in C++

pagination of rss results

2009-08-08 Thread alxsss

Hello, I try to paginate results obtained by using opensearch rss. To do this I need totalResults in the rss feed that comes as opensearch:totalResults100/opensearch:totalResults However, in php's simple_xml_load file results I do not see this part of the feed. Does someone know how to get

topN value in crawl

2009-08-19 Thread alxsss

Hi, I have read a few tutorials on running Nutch to crawl web. However, I still do not understand the meaning of topN variable in crawl command. In tutorials it is suggested to create 3 segments and fetch them with topN=1000. What if I create 100 segments or only one. What would be

Re: topN value in crawl

2009-08-19 Thread alxsss

Thanks. What if urls in my seed file do not have outlinks, let say .pdf files. Should I still specify topN variable? All I need is to index all urls in my seed file. And they are about 1 M. Alex. -Original Message- From: Kirby Bohling kirby.bohl...@gmail.com To:

Re: topN value in crawl

2009-08-20 Thread alxsss

In the tutroial on the wiki the depth is not specified and topN=1000. I run those commands yesterday and it is still running. Will it index all my urls? My seed file has about 20K urls. Thanks. Alex. -Original Message- From: Marko Bauhardt m...@101tec.com To:

job_local_0001: No such file or directory

2009-08-24 Thread alxsss

Hi, After merge of two segments failed with no space available error, I deleted all tmp folders. Now any attempt to use merge or crawl says org.apache.hadoop.util.Shell$ExitCodeException: chmod: /private/tmp/hadoop-root/mapred/system/job_local_0001: No such file or directory Is there any

Re: job_local_0001: No such file or directory

2009-08-25 Thread alxsss

What local mode mean? I was running nutch merge command on my MacPro. I created those folders and set permission then I restarted my laptop and it does not start. When merge command was working, I noticed that free available space was only 1kb. Does this means that merge destroyed my laptop's

content of hadoop-site.xml

2009-08-26 Thread alxsss

Hello, ?I have run merge script? to merge two crawl dirs, one 1.6G another 120MB. But my MacPro with 50G free space did not start, after merge crashed with no space error. I have been told that OSX got corrupted. I looked inside my nutch-1.0/conf/hadoop-site.xml file and it is empty. Can

how to effectively update index

2009-09-03 Thread alxsss

Hello, I have a crawl folder with 2GB data and its index is 160MB. Then, nutch indexed another set of domains and its crawl folder is about 1MB. I wondered if there is an effective way making available for search indexes from both folders without using merge script, since merging large

Strange search results

2009-09-28 Thread alxsss

Hello, I have indexed a lot mp3 files using nutch 1.0. Now, for search in the command line and tomcat for one keyword it gives unrelated records. For example for keyword beyonce search gives all mp3 that have beyonce in the id3 tags and a lot of unrelated files that absolutely does not have

indexing german and turkish like character websites

2009-10-15 Thread alxsss

Hello, I am planning to index websites with german and turkish like symboils, like latin letters with dots over them and etc. Which plugins should I activate? Also I wondered how to use nutch as google or yahoo crawlers. I see google crawler in our apache logs every other minute. It follows

57 matches

Mail list logo