Help Setting Up Nutch 0.8 Distributed

2006-03-16 Thread Dennis Kubes
I am having trouble getting Nutch to work using the DFS. I pulled Nutch 0.8 from SVN and build it just fine using eclipse. I was able to set it up on a Whitebox Enterprise Linux 3 Respin 2 box (800 Mghz, 512M ram) and do a crawl using the local file-system. I was able to setup the was inside of

RE: Help Setting Up Nutch 0.8 Distributed

2006-03-16 Thread Dennis Kubes
openssh-askpass-3.6.1p2-33.30.3 Is it something to do with my slaves file? -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Thursday, March 16, 2006 5:46 PM To: nutch-user@lucene.apache.org Subject: Re: Help Setting Up Nutch 0.8 Distributed Dennis Kubes wrote

RE: Help Setting Up Nutch 0.8 Distributed

2006-03-17 Thread Dennis Kubes
: Dennis -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Thursday, March 16, 2006 6:50 PM To: nutch-user@lucene.apache.org Subject: Re: Help Setting Up Nutch 0.8 Distributed Dennis Kubes wrote: : command not foundlaves.sh: line 29: : command not foundlaves.sh: line

RE: Help Setting Up Nutch 0.8 Distributed

2006-03-17 Thread Dennis Kubes
value/nutch/filesystem/data/value /property property namemapred.system.dir/name value/nutch/filesystem/mapreduce/system/value /property property namemapred.local.dir/name value/nutch/filesystem/mapreduce/local/value /property -Original Message- From: Dennis Kubes [mailto:[EMAIL

RE: Help Setting Up Nutch 0.8 Distributed

2006-03-17 Thread Dennis Kubes
found, forbidden1.size=1 forbidden2.size()=0 060317 102009 Zero targets found, forbidden1.size=1 forbidden2.size()=0 -Original Message- From: Dennis Kubes [mailto:[EMAIL PROTECTED] Sent: Friday, March 17, 2006 9:55 AM To: nutch-user@lucene.apache.org Subject: RE: Help Setting Up Nutch 0.8

Delete Files from NDFS

2006-03-17 Thread Dennis Kubes
Is there a way to delete files from the DFS? I used the dfs -rm option, but the data blocks still are there. Dennis

Large Mapreduce Sizes and Long Index Times

2006-03-17 Thread Dennis Kubes
Finally got an index working with the Hadoop file system but just to do the apache.org site took around 2-3 hours and on each machine the mapreduce local data was around 4.5 Gigs. Anybody know what might be causing this? Dennis

RE: Large Mapreduce Sizes and Long Index Times

2006-03-18 Thread Dennis Kubes
. More with nutch though. Is there a way, you can document Lessons learned ? It can reduce quite a bit of heart breaks during various phases of crawling. I can help you document it if need be. Thanks On 3/17/06, Dennis Kubes [EMAIL PROTECTED] wrote: Finally got an index working with the Hadoop file

Nutch and Hadoop Tutorial Finished

2006-03-20 Thread Dennis Kubes
All, I have finished a lengthy tutorial on how to setup a distributed implementation of nutch and hadoop. Should I post it on this list or is there a better place for it? Dennis

RE: Nutch and Hadoop Tutorial Finished

2006-03-20 Thread Dennis Kubes
. Jake. -Original Message- From: Dennis Kubes [mailto:[EMAIL PROTECTED] Sent: Monday, March 20, 2006 1:01 PM To: nutch-user@lucene.apache.org Subject: Nutch and Hadoop Tutorial Finished All, I have finished a lengthy tutorial on how to setup a distributed implementation of nutch and hadoop

RE: Nutch and Hadoop Tutorial Finished

2006-03-20 Thread Dennis Kubes
Here it is for the list, I will try to put it on the wiki as well. Dennis How to Setup Nutch and Hadoop After searching the web and mailing lists, it seems that there is very little information on how to setup

RE: Nutch and Hadoop Tutorial Finished

2006-03-20 Thread Dennis Kubes
I will add in your changes and then put it up on the wiki. Dennis -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Monday, March 20, 2006 2:41 PM To: nutch-user@lucene.apache.org Subject: Re: Nutch and Hadoop Tutorial Finished Dennis Kubes wrote: Here

RE: Nutch and Hadoop Tutorial Finished

2006-03-20 Thread Dennis Kubes
- From: Dennis Kubes [mailto:[EMAIL PROTECTED] Sent: Monday, March 20, 2006 1:37 PM To: nutch-user@lucene.apache.org Subject: RE: Nutch and Hadoop Tutorial Finished Not to act dumb, but how do I add it to the wiki? Dennis -Original Message- From: Vanderdray, Jacob [mailto:[EMAIL PROTECTED

RE: Nutch and Hadoop Tutorial Finished

2006-03-20 Thread Dennis Kubes
work. Where exactly on the wiki is the tutorial? I'm not seeing it. Cheers, Chris On 3/20/06 2:52 PM, Dennis Kubes [EMAIL PROTECTED] wrote: The NutchHadoop tutorial is now up on the wiki. Dennis -Original Message- From: Vanderdray, Jacob [mailto:[EMAIL PROTECTED] Sent

Meta-Refresh Question

2006-04-04 Thread Dennis Kubes
Silly question but nutch won't follow meta-refreshes will it? Dennis

RE: Meta-Refresh Question

2006-04-04 Thread Dennis Kubes
04, 2006 9:56 AM To: nutch-user@lucene.apache.org Subject: Re: Meta-Refresh Question Dennis Kubes wrote: Silly question but nutch won't follow meta-refreshes will it? It should have, parse-html has support for this (ParseStatus.SUCCESS_REDIRECT), and it did work in 0.7, but now I can see

RE: please help!! It always return 0 hit.

2006-04-07 Thread Dennis Kubes
Copying from Hadooop to local and then performing a search on the index is a question that needs to be posted to the list. My guess would be that you have an older version of the code and there were some bugs copying crc files. I think I remember something about that on the list a little while

Adding Level to Website Parse Data

2006-04-13 Thread Dennis Kubes
I am trying to modify Nutch to add level to the website parse data. What I mean by this is suppose you start parsing a website at its homepage that would be level one. Any links in the same site from the homepage would be level two, links from those pages would be level three and so on. I

Re: redirect treatment

2006-04-15 Thread Dennis Kubes
Protocol level redirects (asp redirects), meaning the server sends a redirect response 3xx code, work correctly in Nutch 0.8 dev. It processes it as a completely new page. If you are doing asp forwards I believe that the original page (www.domain.com/?code.aspxredirect=445454) would be the

Re: redirect treatment

2006-04-15 Thread Dennis Kubes
being redirected and refetch at the new location. Am I correct? And if so, wouldn't nutch then index and display the new, redirected page? I'm using version .7 btw. thanks, Glenn Dennis Kubes wrote: Protocol level redirects (asp redirects), meaning the server sends a redirect response 3xx

Re: nutch readdb question

2006-04-24 Thread Dennis Kubes
I believe the retry numbers are the number of times page fetches failed for recoverable errors and were re-processed before the page was fetched. So most of the pages were fetched on the first try. Some encountered errors and were fetched on the next try and so on. The default setting is a

How to get Text and Parse data for URL

2006-04-25 Thread Dennis Kubes
Can somebody direct me on how to get the stored text and parse metadata for a given url? Dennis

Re: How to get Text and Parse data for URL

2006-04-25 Thread Dennis Kubes
? Dennis Kubes Doug Cutting wrote: NutchBean.getContent() and NutchBean.getParseData() do this, but require a HitDetails instance. In the non-distributed case, the only required field of the HitDetails for these calls is url. In the distributed case, the segment field must also be provided

Re: How to get Text and Parse data for URL

2006-04-25 Thread Dennis Kubes
Truly I am just not understanding the concept of a segment. Dennis Kubes wrote: That got me started. I think that I am not fully understanding the role the segments directory and its contents play. It looks like it holds parse text and parse data in map files, but what is the content folder

File readers inside of Mappers ad Reducers

2006-05-15 Thread Dennis Kubes
How would someone go about reading Map and Sequence file contents in Mappers and Reducers? Is it best to only use the addInputDirectory method and find a way to get all of the data one needs by key or is there a good way to read file contents inside of map and reduce calls. I am not asking

Re: Nutch fetcher waiting inbetween fetch

2006-05-21 Thread Dennis Kubes
Is this possibly a dns issue. We are running a 5M page crawl and are seeing very heavy DNS load. Just a thought. Dennis Stefan Neufeind wrote: Hi, I've encountered that here nutch is fetching quite a sum or URLs from a long list (about 25.000). But from time to time nutch is waiting for 10

Re: Nutch fetcher waiting inbetween fetch

2006-05-22 Thread Dennis Kubes
... but I don't know how/when the fetcher writes data to disk etc. Regards, Stefan Dennis Kubes wrote: Is this possibly a dns issue. We are running a 5M page crawl and are seeing very heavy DNS load. Just a thought. Dennis Stefan Neufeind wrote: Hi, I've encountered that here nutch

Re: Nutch fetcher waiting inbetween fetch

2006-05-22 Thread Dennis Kubes
What are the advantages of djbdns? We were looking at putting a caching nameserver on each fetcher node to reduce load on a single dns server. Andrzej Bialecki wrote: Dennis Kubes wrote: What we were seeing is the dns server cached the addresses in memory (bind 9x..) and because we were

Restarting Just Reduce Part of Fetch

2006-05-22 Thread Dennis Kubes
Is there a way to restart just the reduce part of the fetching process if it failed. The Map jobs for 5M pages completed but the reduce jobs failed late. Is there a way to recover this so I don't have to recrawl all 5M pages? Dennis

Re: Restarting Just Reduce Part of Fetch

2006-05-22 Thread Dennis Kubes
I guess everything has a good side in that I will look into implementing this feature :). I think I am going to go back to smaller crawls and try merging incrementally. Andrzej Bialecki wrote: Dennis Kubes wrote: Is there a way to restart just the reduce part of the fetching process

Re: Run-Time Error

2006-05-26 Thread Dennis Kubes
On the launcher under classpath you will need to add the directory above plugins. Make sure this is on the eclipse laucher though. Setting it on the project won't help TDLN wrote: Did you add the plugins directory to your classpath and does it contain all of your plugins? Rgrds, Thomas On

Fetcher Stops Reports Pushes CPU to 100%

2006-06-06 Thread Dennis Kubes
Has anybody seen behavior where a fetcher duing the reduce phase will stop reporting and push the CPU to 100% and stay that way until the task times out. I am seeing this on Fedora 5 minimal running Java 1.5_06 on dual core processor machines with 2G of memory. I have tracked this down and I

Re: Fetcher Stops Reports Pushes CPU to 100%

2006-06-06 Thread Dennis Kubes
I have some special settings for larger fetches? Dennis Andrzej Bialecki wrote: Dennis Kubes wrote: Has anybody seen behavior where a fetcher duing the reduce phase will stop reporting and push the CPU to 100% and stay that way until the task times out. I am seeing this on Fedora 5 minimal

Re: Fetcher Stops Reports Pushes CPU to 100%

2006-06-06 Thread Dennis Kubes
). And this is happening on multiple machines so I do think it is a machine problem. Again I need to spend some time looking through thread dumps. Dennis Andrzej Bialecki wrote: Dennis Kubes wrote: Do you think it is the parsing that is causing it? Just checking ... probably not. You could figure out from

Re: Fetcher Stops Reports Pushes CPU to 100%

2006-06-09 Thread Dennis Kubes
. this: -.*(/.+?)/.*?\1/.*?\1/ changed to: -.*?(/.+?)/.*?\1/.*?\1/ I am currently testing this to see if it runs correctly without stalling as before. Problem is that I am not a regular expressions expert. Will changing this regex affect this expression in a negative way? Dennis Dennis Kubes wrote: I

Re: Fetcher Stops Reports Pushes CPU to 100%

2006-06-10 Thread Dennis Kubes
don't think the other urls in the default regex-urlfilter file will cause any problems because they are not greedy, but I would suggest that we look at either changing this regular expression or removing it altogether from the default install. Dennis Andrzej Bialecki wrote: Dennis Kubes wrote

Large Scale Searching

2006-06-12 Thread Dennis Kubes
Is anyone doing large scale searching and if so what kind of architecture is good. I have a 25G index now (merged) and the searches are failing due to memory constraints. Is is better to have multiple smaller indexes across machines. If so are those indexes stored on local machines or in

Setting up distributed searching..no results returned.

2006-06-16 Thread Dennis Kubes
Can someone explain how to setup distributed searching. I have a nutch-site.xml file setup like this: configuration property namefs.default.name/name valuelocal/value /property property namesearcher.dir/name valueC:\SERVERS\tomcat\webapps\ROOT\WEB-INF\classes/value /property

Re: No FS indicated, using default:local

2006-06-24 Thread Dennis Kubes
is there a folder (not file) called db in the directory from which you are starting the script? Dennis Benedikt Schackenberg wrote: hello, after i run bin/nutch generate db segments i get 060624 145542 parsing file:/home/nutch/nutch-0.7.1/conf/nutch-default.xml 060624 145542 parsing

Re: Fetcher hanging temporarily on deflateBytes method

2006-06-29 Thread Dennis Kubes
We have seen this before too. If is the same problem it is the regex url filter. Comment out the -.*(/.+?)/.*?\1/.*?\1/ expression in the regex-urlfilter.txt file and it should resolve itself. Also search the forum for Fetcher stops pushes cpu to 100%. Dennis Daniel Varela Santoalla

Re: Fetcher hanging temporarily on deflateBytes method

2006-06-29 Thread Dennis Kubes
this helps. Dennis Daniel Varela Santoalla wrote: Hello Dennis et al Dennis Kubes wrote: We have seen this before too. If is the same problem it is the regex url filter. Comment out the -.*(/.+?)/.*?\1/.*?\1/ expression in the regex-urlfilter.txt file and it should resolve itself. I'm afraid

Input and Output Value Class Types

2006-06-29 Thread Dennis Kubes
All, Is there a way to get around having to have the input value class and output value class be the same? I have an object writable that I am trying to unwrap. Dennis

Re: Input and Output Value Class Types

2006-06-29 Thread Dennis Kubes
be this can help you. Also please browse the haddop developer list archive since there was some related discussion. HTH Stefan Am 29.06.2006 um 14:41 schrieb Dennis Kubes: All, Is there a way to get around having to have the input value class and output value class be the same? I have an object

Getting Keywords from Metatags

2006-07-26 Thread Dennis Kubes
Does anyone know how to get the keywords from the meta tags of a page. I have been looking around but it wasn't immediately apparent how to do this. Dennis

Fetch jumps to 1.0 complete

2006-08-04 Thread Dennis Kubes
Hi Everybody, I am running some fetches and some of the task are going along fine and are about 20% complete and then they will mysteriously jump to 100% complete Each time I get the Aborting with N hung threads in the logs. Is anybody else seeing this? It there anyway to get around the

Re: Fetch jumps to 1.0 complete

2006-08-04 Thread Dennis Kubes
the crawl-url filter regular expressions for [EMAIL PROTECTED] and -.*(/.+?)/.*?\1/.*?\1/. Andrzej , didn't you say awhile back when we were looking at regular expressions for a different stalling problem that you don't use these in your production systems? Dennis Andrzej Bialecki wrote: Dennis

Re: Fetch jumps to 1.0 complete

2006-08-04 Thread Dennis Kubes
it was failing. As soon as I get this running automatically in production I am going to try and implement the 339 patch. Dennis Andrzej Bialecki wrote: Dennis Kubes wrote: I moved off of the most recent dev branches for our production system and put them on the release version for 0.8. I only

Re: Fetch jumps to 1.0 complete

2006-08-05 Thread Dennis Kubes
Bialecki wrote: Dennis Kubes wrote: Just a thought going through the fetcher code. If the robots.txt specifies a delay = the task timeout value, the task thread will sleep for that amount of time and eventually be considered a hung thread even though it is really just sleeping. Of course I

Distributed Searching Index Size

2006-08-07 Thread Dennis Kubes
Does anybody have recommendations as to index size per machine using distributed searching. We are currently doing about 1M pages per index with each index on a separate machine. That seems work fast. Just wanted to know what stats other people were using. Also I thought I remembered

Re: Aborting with hung threads

2006-08-09 Thread Dennis Kubes
I have come up with a temporary hack for this. This is caused by crawl delays in the robots.txt file being set to huge amounts (for example I saw many that were set to 500 seconds and some as high as 72000 seconds). Attached is a temporary patch. It will get it working but is definitely not

Re: problems with the dfs commande

2006-08-09 Thread Dennis Kubes
Please give more information about the specific errors you are seeing. Dennis kawther khazri wrote: Hello, I'm following the guide http://wiki.apache.org/nutch/NutchHadoopTutorial?highlight=%28nutch%29 to install Nutch, I have a problem whith the commande bin/hadoop dfs -ls (a lot of

Re: Fetch jumps to 1.0 complete

2006-08-09 Thread Dennis Kubes
a ProtocolStatus with say GONE or something like that? Dennis Dennis Kubes wrote: I added some test code that hacks a 30 second delay when the delay is greater than 30 seconds. It prints out the original delay value. Here is the output I am seeing: task_0005_m_05_0 Someone is setting way to long

Re: number of mapper

2006-08-10 Thread Dennis Kubes
There is also a mapred.tasktracker.tasks.maximum variable which may be causing the task number to be different. Dennis Murat Ali Bayir wrote: Hi everbody, Although I change the number of mappers in hadoop-site.xml and use job.setNumMapTasks method the system gives another number as a number

Re: problems with start-all command

2006-08-10 Thread Dennis Kubes
The name node is running. Run the bin/stop-all.sh script first and then do a ps -ef | grep NameNode to see if the process is still running. If it is, it may need to be killed by hand kill -9 processid. The second problem is the setup of ssh keys as described in previous email. Also I would

Re: crawl-urlfilter subpages of domains

2006-08-12 Thread Dennis Kubes
You can use a suffix filter if there are no query strings. Dennis Jens Martin Schubert wrote: Hello, is it possible to crawl e.g. http://www.domain.com, but to skip crawling all urls matching to (http://www.domain.com/subpage/) I tried to achieve this with

Re: On fetcher slowness

2006-08-13 Thread Dennis Kubes
. But the problem that I run into are the fetcher threads hangs, and for crawl delay/robots.txt file (Please see Dennis Kubes posting on this). Yes, these are definitely problems. Stefan has been working on a queue-based fetcher that uses NIO. Seems very promising, but not yet ready for prime time. -- Ken

Re: crawl w/o store

2006-08-13 Thread Dennis Kubes
You can add the property to the nutch-site.xml file to take precedence over default in nutch-default.xml file. The value is as below. This is for Nutch 0.8 I am not sure if this is the same for 0.72 property namefetcher.store.content/name valuefalse/value descriptionIf true, fetcher will

Re: what Linux distribution goes best with Nutch?

2006-08-17 Thread Dennis Kubes
We are running our clusters on FC5 with absolutely no problems. I actually had more problems on whitebox linux (redhat enterprise 4). What kind of problems are you encountering? Dennis Florian Fricker wrote: I run Nutch on Gentoo, works with no problems. Regards Ken Gregoire wrote: Yup,

Re: what Linux distribution goes best with Nutch?

2006-08-17 Thread Dennis Kubes
I installed nutch, tomcat, and java fresh. All of my FC5 installs use only the minimal amount of packages, I think just editors, admin tools and base. I don't put x servers on them. We also use network boots and kickstart load to get a consistent install across machines. We install java,

Re: problem in crawling......

2006-08-22 Thread Dennis Kubes
Unfortunately you have to start over. We started breaking our crawls into 100K to 500K runs because of this. Dennis Abdelhakim Diab wrote: Hi all: What can I do if I were crawling a big list of sites and suddenly the crawler stopped for any problem ? must I return the whole process or I can

Re: log4j:WARN Please initialize the log4j system properly.

2006-08-22 Thread Dennis Kubes
its looking for the ${hadoop.log.dir} variable so you can either set that on the command line or you can change you log4j.properties file the conf like I did: log4j.rootLogger=INFO,console log4j.appender.console=org.apache.log4j.ConsoleAppender log4j.appender.console.target=System.err

Re: how to set NUTCH_JAVA_HOME

2006-08-29 Thread Dennis Kubes
in windows under control panel - system - advanced - environment variables - system variables. Dennis Kubes Philip Brown wrote: nutnoob wrote: how to set NUTCH_JAVA_HOME ??? I have java install in machine but don't know how to set it for nutch. please help me . see link: Setting

Re: how to combine two run's result for search

2006-09-04 Thread Dennis Kubes
You can keep the indexes separate and use the distributed search server, one per index or you can use the mergedb and mergesegs commands to merge the two runs into a single crawldb and a single segments then re-run the invertlinks and index to create a single index file which can then be

Re: Customize the crawl process

2006-09-08 Thread Dennis Kubes
You would need to modify Fetcher line 433 to use a a text output format like this: job.setOutputFormat(TextOutputFormat.class); and you would need to modify Fetcher line 307 only collect the information you are looking for, maybe something link this: Outlink[] links =

Re: # of tasks executed in parallel

2006-09-08 Thread Dennis Kubes
How many urls are you fetching and does each machine have the same settings as below? Remember that number of fetchers is number of fetcher threads per task per machine. So you would be running 2 tasks per machine * 12 threads * 3 machines = 75 fetchers. Dennis Vishal Shah wrote: Hi,

Re: Reduce Error during fetch

2006-09-08 Thread Dennis Kubes
You may be running into problems with regex stalls on filtering. Try removing the regex filter from the nutch-site.xml plugin.includes property. I was having similar problems before switching to just use prefix and suffix filters as below. I attached my prefix and suffix url filter files

Re: two nutch indexes on same webserver

2006-09-08 Thread Dennis Kubes
Assuming you have two separate war files deployed, it should be as easy as setting the searcher.dir property in the nutch-site.xml file in the different web-inf directories to the separate index locations. If you want to go the distributed searching route there is a in depth explanation on

Re: java.lang.OutOfMemoryError: Java heap space

2006-09-10 Thread Dennis Kubes
I don't know if it is the same in 7.2 but in .8 there is a hadoop-env.sh file where you can uncomment the JAVA_OPTS variable and give the heap more memory. Either way the JVM must be started with more memory, something like this vm option -Xmx1024M for a 1Gig heap. Dennis Bogdan Kecman

crawl_generate

2006-09-11 Thread Dennis Kubes
Besides the initial fetch is the crawl_generate folder in a segment used anywhere else? Would it be safe to delete or not have the crawl_generate folder while searching? Dennis

Nutch and DFS for Loading

2006-09-11 Thread Dennis Kubes
Does anybody have a setup where nutch is loaded from a NFS mount. Not Hadoop, just using NFS to start nutch from a centralized location? Dennis

Re: Question about using Nutch plug-ins as libraries

2006-09-13 Thread Dennis Kubes
Is the plugins folder in the root of the war? Dennis Trym B. Asserson wrote: Hello, We're currently developing an application using the Lucene API for building a search engine and as part of the application we have a component for parsing several file formats. For this component we were

Re: Nutch Cannot Find Indexed Pages?

2006-09-14 Thread Dennis Kubes
Does it not have anything in the database or are there entries in the index but nothing is being returned by the search? Dennis victor_emailbox wrote: Can anyone help? Thanks. victor_emailbox wrote: Hi, I followed all the steps in the 0.8 tutorial except that I have only 2 urls in the

Re: How to build nutch with ant?

2006-09-17 Thread Dennis Kubes
run ant package. the full distribution is under build/nutch-x,x folder. heack wrote: I run ant in nutch base dir, and It compile successfully. But it does not generate nutch-0.8.jar or nutch-0.8.war, only a nutch-0.8.job file(and other plunge class) in build folder. What options should I use

Re: (Problem)Why after I ran ant in nutch base dir, NO nutch-0.8.jar found in build dir?

2006-09-17 Thread Dennis Kubes
because the default target is job which creates the job file, run package to create all. heack wrote: Only a nutch-0.8.job file there. And also question what the next step should I do after I modified source code like NutchAnalysis.jj and use ant to build it? The search.jsp seems not use the

Re: how to turn on fetcher log?

2006-09-21 Thread Dennis Kubes
It depends on settings in the conf/log4j.properites file for the level of logging. The log files are in the HADOOP_LOG_DIR directory which can be set in the hadoop-env.sh file in the conf directory. Usually the file is called hadoop-phoenix-tasktracker... Dennis Mike Smith wrote: Hi, I

Re: no results in nutch 0.8.1

2006-09-27 Thread Dennis Kubes
Did you setup the user agent name in the nutch-site.xml file or nutch-default.xml file? Dennis carmmello wrote: I have followed the steps in the 0.8.1 tutorial and, also, I have been using Nutch for some time now, without seeing the kind of problem I am encountering now. After I have

Re: no results in nutch 0.8.1

2006-09-28 Thread Dennis Kubes
, I can not find any error regarding this - Original Message - From: Dennis Kubes [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Wednesday, September 27, 2006 7:59 PM Subject: Re: no results in nutch 0.8.1 Did you setup the user agent name in the nutch-site.xml file or nutch

Re: crawl db disrtibution on different data nodes

2006-10-10 Thread Dennis Kubes
It completely depends on the number of urls in the crawldb. Dennis jaison Qburst wrote: What will be the maximum size of crawlDb on a single node?

Re: Searching terms saved in a file

2006-10-10 Thread Dennis Kubes
You would have to write something that would loop through the file and then construct a Query object using the addRequired and addProhibited methods to add your terms and phrases. Then pass that into the appropriate NutchBean search method to get your results. Dennis frgrfg gfsdgffsd wrote:

Re: Database update

2006-10-10 Thread Dennis Kubes
You could write a MapReduce job that would use the parse_data folder as input and inside the map or reduce class depending on your logic use jdbc to update to mysql. It would look something like this for the job configuration. JobConf yourjob= new NutchJob(conf); for (int i = 0; i

Re: java.lang.NoSuchMethodError while indexing

2006-10-10 Thread Dennis Kubes
What java version are you using. Might be needing java 5? Dennis Adam Borkowski wrote: Question from then newbie. I've just downloaded version 0.8.1 and going trough the tutorial. Almost got to the end, but after index command: bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb

Re: Extending BasicQueryFilter for a new plugiin?

2006-10-17 Thread Dennis Kubes
I don't know exactly what you are wanting to do below. Adding a term through a query filter would be something like this: import org.apache.nutch.searcher.FieldQueryFilter; import org.apache.hadoop.conf.Configuration; public class NewQueryFilter extends FieldQueryFilter { public

Re: java 1.5 or 1.4

2006-10-17 Thread Dennis Kubes
A guess would be that somewhere in your classpath you have the wrong version of xalan. Dennis NG-Marketing, M.Schneider wrote: Hello list, when I use Java 1.4 everything works well, but if I switch to 1.5 i have the following error:

Re: fetch fails at reduce stage because can not sense heartbeat for 600 seconds

2006-10-17 Thread Dennis Kubes
I have seen this happen before if the box is loaded down with too many tasks and the IO is maxed. I have also seen this happen when the regex filters spin out. We changed our systems to use only prefix and suffix url filters and that cleared up those types of problems for us. Dennis Mike

Re: fetch fails at reduce stage because can not sense heartbeat for 600 seconds

2006-10-18 Thread Dennis Kubes
I agree with Andrzej that a thread dump would be best. Also what version of nutch are you using? Dennis Andrzej Bialecki wrote: Mike Smith wrote: Hi Dennis, But it doesn't make sense since the reducers' keys are URLs and the heartbeat cannot be sent when the reduce task is called. Since

Re: fetch fails at reduce stage because can not sense heartbeat for 600 seconds

2006-10-20 Thread Dennis Kubes
This was where ours was freezing as well. Don't know why the regex causes it other than it is greedy. I will try and take a look at the JS parser when I get some time. Dennis Mike Smith wrote: Thank you guys for the hints and helps. I could manage to find the root of the problem. When the

Re: Plugin HitCollector

2006-10-23 Thread Dennis Kubes
We are running into the same issue. Remember that hits just give you doc id and getting hit details from the hit does another read. So looping through the hits to access every document will do a read per document. If it is a small number of hits, no big deal, but the more hits to access,

Re: generate db segments topN with TYPE

2006-10-23 Thread Dennis Kubes
You could use suffix filters to filter out any document that isn't a PDF. Dennis Marco Vanossi wrote: Hi, Do you think there is an easy way to do make nutch generate a list of only certain documents type to fetch? For example: If one would like to crawl only PDF docs (after some pages was

Re: Plugin HitCollector

2006-10-24 Thread Dennis Kubes
Our problem is that we need to count hits for sub-categories. There are over 550,000 categories. I am assuming I can't do this inside of a bitset? Is there a good way to do this type of functionality? Dennis Andrzej Bialecki wrote: Dennis Kubes wrote: We are running into the same issue

Re: Re : Urgent : Fetcher aborts with hung threads

2006-11-03 Thread Dennis Kubes
The reason no one answered is because it has been answered before a couple of times. If you do a search on this mailing list for fetcher slowness or fetcher hung threads you will get answers. You can also take a look at NUTCH-344. This problem has come up before and there are patches which

Re: query to hit all

2006-11-08 Thread Dennis Kubes
Segment is indexed as a field so you could write a query filter the includes the segment name. You could also use an IndexReader and loop through document by document from 0 to maxDoc() -1 checking for the segment field. The second option is much more resource intensive though. Dennis

Re: classifying content

2006-12-06 Thread Dennis Kubes
You may also want to look at bayesian statistics, support vector machines, and machine learning algorithms. Dennis kauu wrote: this is exactly also what i wander On 12/5/06, chad savage [EMAIL PROTECTED] wrote: Hello All, I'm doing some research on how to classify documents into

Re: Which Operating-System do you use for Nutch

2006-12-21 Thread Dennis Kubes
Fedora Core 5 minimal install with Java 1.5.10 Tomi NA wrote: On 9/26/06, Jim Wilson [EMAIL PROTECTED] wrote: I'd do it, but I'm too busy being consumed with worries about the lack of support for HTTP/NTLM credentials and SMB fileshare indexing. Arrrgg - tis another sad day in the life of

Re: Cannot generate all injected URLS

2006-12-21 Thread Dennis Kubes
What was the problem? Dennis Frank Kempf wrote: solved THX

Re: dump page content to Windows file system?

2006-12-21 Thread Dennis Kubes
If you mean from the DFS to local filesystem you can do a copyToLocal. If you mean from a binary to a readable format your would need to write a MapReduce job and specify a TextOutputFormat. If you are trying to read the crawl database you can use the nutch readdb command. Dennis David

Re: Need help with deleteduplicates

2006-12-29 Thread Dennis Kubes
(that.hash)) { // order first by hash return this.hash.compareTo(that.hash); ... So, is that where I would place my similary score and return that value there? Dennis Kubes wrote: If I am understanding what you are asking, in the getRecordReader method of the InputFormat innner

Re: NUTCH 0.8.1: Difficulties with Analyzers

2007-01-01 Thread Dennis Kubes
I have not used the french analyzer...but did you use the french analyzer for both indexing and searching? Dennis [EMAIL PROTECTED] wrote: I am having a hardtime implementing the French Analyzer... Any help with be immensely appreciated. Here are the details, first I tried with the official

Re: Duplicate URLs with slightly different URIs.. how to normalize?

2007-01-03 Thread Dennis Kubes
-parse, re-index and then dedup. Another option is a url filter that simply removes urls with the #a as they are internal links. Again you would need to re-parse, etc. Let me know if you need more information on how to do this. Dennis Kubes Brian Whitman wrote: I'm using Solr to search

Re: re-parse hang?

2007-01-04 Thread Dennis Kubes
What nutch version are you using and what is your setup. An 80K reparse should only take a few minutes at most. Dennis Brian Whitman wrote: On yesterdays nutch-nightly, from Dennis Kubes suggestions on how to normalize URLs, I removed the parsed folders via rm -rf crawl_parse parse_data

Re: Issues Starting Hadoop Process in Nutch0.9l.1

2007-01-04 Thread Dennis Kubes
I would take a look at the processes on the namenode server and see if the namenode has started up. It doesn't look like it did. If this is a new install, did you format the namenode? Dennis srinath wrote: Hi, While starting hadoop process we are getting the following error in logs

  1   2   3   4   5   >