No Search Result if add boosting factor in search field!

2008-08-01 Thread dealmaker
Hi, I am using Nutch 0.8.1, and it returns 2 search results if I enter search in the search field. If I enter search^3 in the search field, it returns 0 result. Why? I tried other boosting factors and search words as well, same, no result. But it returns results if I don't include the

Re: No Search Result if add boosting factor in search field!

2008-08-12 Thread dealmaker
Ok, I realized that Nutch disabled search term boosting in query. You can do that only with Lucene api. Anyway to go around that? Thanks. dealmaker wrote: Hi, I am using Nutch 0.8.1, and it returns 2 search results if I enter search in the search field. If I enter search^3

Nutch keeps stripping my Url parameters, how do I stop that?

2008-08-12 Thread dealmaker
Hi, I am entering a query for a url like this: url:http://www.mysite.com/index.php?option=abcanother=param;, but I found that it always strip the url and make it http://www.mysite.com/index.php?option=abc, it is missing the parameters after that. How do I change it? Thanks. -- View this

Most Common Anchor Text list?

2008-08-19 Thread dealmaker
Hi, I am looking for a list of most common anchor text (words and phases). Something like a list of stop words, but I am looking for one that is specifically for anchor text. e.g. click here is one of them. Does anyone know where is it? Thank you. -- View this message in context:

Re: Most Common Anchor Text list?

2008-08-20 Thread dealmaker
or whatever. Brian Ulicny On Tue, 19 Aug 2008 22:47:19 -0700 (PDT), dealmaker [EMAIL PROTECTED] said: Hi, I am looking for a list of most common anchor text (words and phases). Something like a list of stop words, but I am looking for one that is specifically for anchor text. e.g. click here

Can Nutch Search Subset of Websites?

2009-01-23 Thread dealmaker
Hi, ok, I crawled and indexed 1000 websites, and I am trying to return search result of only 5 websites. e.g. there may be 100 websites in the search result, but I am interested in only 5 specific websites (site1.com, site2.comsite5.com only), so I am more interested in the rank of these 5

How do you setup your svn for your nutch code?

2009-03-01 Thread dealmaker
Hi, I am modifying Nutch 0.9 code for my project. Currently, I put all my 0.9 code in my local main trunk. But I know that 1.0 will be out soon, and want to use 1.0 code instead in near future. What is the best way to setup svn to do that? Should I just sync the main trunk from apache server

Re: How do you setup your svn for your nutch code?

2009-03-01 Thread dealmaker
, dealmaker vin...@gmail.com wrote: Hi, I am modifying Nutch 0.9 code for my project. Currently, I put all my 0.9 code in my local main trunk. But I know that 1.0 will be out soon, and want to use 1.0 code instead in near future. What is the best way to setup svn to do that? Should I

Re: How do you setup your svn for your nutch code?

2009-03-01 Thread dealmaker
merge with the git version of nutch. On Mon, Mar 2, 2009 at 9:27 AM, dealmaker vin...@gmail.com wrote: no, it's not the official 1.0. Even so, there may be 1.1 in future. I just want to know how to setup svn for future versions that needs minimum maintenance. Thanks. Tony Wang-3 wrote

Re: How do you setup your svn for your nutch code?

2009-03-01 Thread dealmaker
and also, do u clone the main trunk or just for examples 0.9? Dingding Ye wrote: I have used git-svn to clone the nutch project. And then use a git repo to manage personal version and do periodical merge with the git version of nutch. On Mon, Mar 2, 2009 at 9:27 AM, dealmaker vin

Re: How do you setup your svn for your nutch code?

2009-03-01 Thread dealmaker
svn. It helps the smooth merge. What i did before is to clone main trunk. It should fit for 0.9 also. However, if you make rapid changes to the sources, i think none are helpful and you have to solve the conflicts yourself.. On Mon, Mar 2, 2009 at 11:55 AM, dealmaker vin...@gmail.com

getIndexDocNo ( ) doesn't exist in Nutch nightly build anymore?

2009-03-02 Thread dealmaker
I was modifying Nutch 0.9 and my following code worked fine in NutchBean class: MoreLikeThis mlt = new MoreLikeThis (((IndexSearcher) searcher).getIndexReader()); org.apache.lucene.search.Query q = mlt.like (myHits.getHit (0).getIndexDocNo ()); After I upgraded to the nightly build today, I

Does MoreLikeThis work with Nutch 1.0 / nightly build?

2009-03-02 Thread dealmaker
I just downloaded the nightly build, and it seems that nutch no longer uses document number, and it uses key to locate document. Because of that, my code with morelikethis doesn't work anymore. How do I make the new nutch code to work with morelikethis? The following was my old code:

Hadoop java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) while indexing.

2009-03-04 Thread dealmaker
I am using the nutch nightly build #741 (Mar 3, 2009 4:01:53 AM). I am at the final phrase of crawling following the tutorial on Nutch.org website. I ran the following command, and I got exception in Hadoop. I double checked the folder path in nutch-site.xml, and they are correct. I tried

Re: Exception when crawling

2009-03-04 Thread dealmaker
I have similar problem with nightly build #741 (Mar 3, 2009 4:01:53 AM). What's wrong? Log from Hadoop: 2009-03-04 14:30:31,531 WARN mapred.LocalJobRunner - job_local_0001 java.lang.IllegalArgumentException: it doesn't make sense to have a field that is neither indexed nor stored at

Re: Problem with crawling using the latest 1.0 trunk

2009-03-04 Thread dealmaker
Glad I found this thread, I got the same problem in build #741. Would someone issue a fix into the trunk and so we can have a working nightly build? Thanks. tigertail wrote: Andrzej, I checked the latest SVN version and I faced the same problem. As It has been a long time I do not

Where can I download old carrot2 2.1 code binary?

2009-03-05 Thread dealmaker
Hi, Due to a carrot2 2.1 plugin that my system is using, I need to find the source code and binary of carrot2 2.1 to examine the source code to do further tweaking. Do you know where can I download the version 2.1? I don't have time to upgrade to 3.0 because there is a change of API in 3.0.

How to ignore search results that don't have related keywords in main body?

2009-03-22 Thread dealmaker
Most webpages have sections like navigation, header, left column for related links, footer, etc. How can I prevent Nutch from returning search results that contain keywords only in the non-main body of the page? e.g. keywords can appear in navigation bar or footer, but they may not appear in

Template Detection?

2009-03-23 Thread dealmaker
Hi, Does Nutch or any plugin have the template detection? It seems that navigation and footer sections usually distort the ranking of search results. Is there already open source project or code that I can integrate to Nutch to give it the ability of template detection? Thanks. -- View this

Re: Template Detection?

2009-03-23 Thread dealmaker
is there any substitution to Template Detection? Any easy hack or already-made plugins or open source projects that can improve the search results in certain degree without template detection? Thanks. Andrzej Bialecki wrote: dealmaker wrote: Hi, Does Nutch or any plugin have

How to save additional data into crawl db or segment?

2009-03-24 Thread dealmaker
Hi, During crawling/indexing time, I want to do some additional processing with the raw html Nutch just crawled and I want to save an additional custom data based on the raw html for later retrieval. Should I save these additional custom data to crawlDB or Segment or somewhere else? I need to

How to find out the encoding and format of the content stored in the index?

2009-04-04 Thread dealmaker
Hi, I am trying to find out the encoding and format of the content stored in the index. I modified the code in BasicIndexFilter.java to store the content. But I need to know the encoding of the stored content which doesn't seem to store this information. I also need to know whether it's

Re: Hadoop java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) while indexing.

2009-04-09 Thread dealmaker
it? dealmaker wrote: I am using the nutch nightly build #741 (Mar 3, 2009 4:01:53 AM). I am at the final phrase of crawling following the tutorial on Nutch.org website. I ran the following command, and I got exception in Hadoop. I double checked the folder path in nutch-site.xml

How come getContent returns HTML Entities?

2009-04-11 Thread dealmaker
I modified the index code to make it store the html after fetching. But when I use getContent( ) to get the html, it returns HTML entities like lt instead of . Is there any purpose to store the html as html entities? Can I just store the html as regular html, not html entities? Thanks. --

How does Nutch Fetch Files in Relative Path?

2009-04-14 Thread dealmaker
Hi, I am writing code to make Nutch to fetch files in relative path in a html page. The format of url of the webpage can be http://www.mysite.com/folder1/page.html or http://www.mysite.com/folder1 The format of path of the file can be ../../image.jpg, or http://www.example.com/image.jpg;, or

How to get Bean without Servlet?

2009-05-13 Thread dealmaker
Hi, I am using bean in a class without servlet. How do I retrieve Bean with the conf without using servlet? I see the following code in OpenSearchServlet.java. I need conf to get Bean, but both NutchBean.get ( ) and NutchConfiguration.get ( ) needs servlet. How do I go around that? Thanks.