Hi,
I am using Nutch 0.8.1, and it returns 2 search results if I enter
search in the search field. If I enter search^3 in the search field, it
returns 0 result. Why? I tried other boosting factors and search words as
well, same, no result. But it returns results if I don't include the
Ok, I realized that Nutch disabled search term boosting in query. You can do
that only with Lucene api.
Anyway to go around that?
Thanks.
dealmaker wrote:
Hi,
I am using Nutch 0.8.1, and it returns 2 search results if I enter
search in the search field. If I enter search^3
Hi,
I am entering a query for a url like this:
url:http://www.mysite.com/index.php?option=abcanother=param;, but I found
that it always strip the url and make it
http://www.mysite.com/index.php?option=abc, it is missing the parameters
after that. How do I change it?
Thanks.
--
View this
Hi,
I am looking for a list of most common anchor text (words and phases).
Something like a list of stop words, but I am looking for one that is
specifically for anchor text. e.g. click here is one of them. Does
anyone know where is it?
Thank you.
--
View this message in context:
or whatever.
Brian Ulicny
On Tue, 19 Aug 2008 22:47:19 -0700 (PDT), dealmaker [EMAIL PROTECTED]
said:
Hi,
I am looking for a list of most common anchor text (words and phases).
Something like a list of stop words, but I am looking for one that is
specifically for anchor text. e.g. click here
Hi,
ok, I crawled and indexed 1000 websites, and I am trying to return search
result of only 5 websites. e.g. there may be 100 websites in the search
result, but I am interested in only 5 specific websites (site1.com,
site2.comsite5.com only), so I am more interested in the rank of these 5
Hi,
I am modifying Nutch 0.9 code for my project. Currently, I put all my 0.9
code in my local main trunk. But I know that 1.0 will be out soon, and want
to use 1.0 code instead in near future. What is the best way to setup svn to
do that? Should I just sync the main trunk from apache server
, dealmaker vin...@gmail.com wrote:
Hi,
I am modifying Nutch 0.9 code for my project. Currently, I put all my
0.9
code in my local main trunk. But I know that 1.0 will be out soon, and
want
to use 1.0 code instead in near future. What is the best way to setup svn
to
do that? Should I
merge
with the git version of nutch.
On Mon, Mar 2, 2009 at 9:27 AM, dealmaker vin...@gmail.com wrote:
no, it's not the official 1.0. Even so, there may be 1.1 in future. I
just
want to know how to setup svn for future versions that needs minimum
maintenance.
Thanks.
Tony Wang-3 wrote
and also, do u clone the main trunk or just for examples 0.9?
Dingding Ye wrote:
I have used git-svn to clone the nutch project.
And then use a git repo to manage personal version and do periodical merge
with the git version of nutch.
On Mon, Mar 2, 2009 at 9:27 AM, dealmaker vin
svn. It helps the smooth merge.
What i did before is to clone main trunk. It should fit for 0.9 also.
However, if you make rapid changes to the sources, i think none are
helpful
and you have to solve the conflicts yourself..
On Mon, Mar 2, 2009 at 11:55 AM, dealmaker vin...@gmail.com
I was modifying Nutch 0.9 and my following code worked fine in NutchBean
class:
MoreLikeThis mlt = new MoreLikeThis (((IndexSearcher)
searcher).getIndexReader());
org.apache.lucene.search.Query q = mlt.like (myHits.getHit (0).getIndexDocNo
());
After I upgraded to the nightly build today, I
I just downloaded the nightly build, and it seems that nutch no longer uses
document number, and it uses key to locate document. Because of that, my
code with morelikethis doesn't work anymore. How do I make the new nutch
code to work with morelikethis?
The following was my old code:
I am using the nutch nightly build #741 (Mar 3, 2009 4:01:53 AM). I am at
the final phrase of crawling following the tutorial on Nutch.org website. I
ran the following command, and I got exception in Hadoop. I double checked
the folder path in nutch-site.xml, and they are correct. I tried
I have similar problem with nightly build #741 (Mar 3, 2009 4:01:53 AM).
What's wrong?
Log from Hadoop:
2009-03-04 14:30:31,531 WARN mapred.LocalJobRunner - job_local_0001
java.lang.IllegalArgumentException: it doesn't make sense to have a field
that is neither indexed nor stored
at
Glad I found this thread, I got the same problem in build #741. Would
someone issue a fix into the trunk and so we can have a working nightly
build?
Thanks.
tigertail wrote:
Andrzej,
I checked the latest SVN version and I faced the same problem. As It has
been a long time I do not
Hi,
Due to a carrot2 2.1 plugin that my system is using, I need to find the
source code and binary of carrot2 2.1 to examine the source code to do
further tweaking. Do you know where can I download the version 2.1? I
don't have time to upgrade to 3.0 because there is a change of API in 3.0.
Most webpages have sections like navigation, header, left column for related
links, footer, etc. How can I prevent Nutch from returning search results
that contain keywords only in the non-main body of the page? e.g. keywords
can appear in navigation bar or footer, but they may not appear in
Hi,
Does Nutch or any plugin have the template detection? It seems that
navigation and footer sections usually distort the ranking of search
results. Is there already open source project or code that I can integrate
to Nutch to give it the ability of template detection?
Thanks.
--
View this
is there any substitution to Template Detection? Any easy hack or
already-made plugins or open source projects that can improve the search
results in certain degree without template detection?
Thanks.
Andrzej Bialecki wrote:
dealmaker wrote:
Hi,
Does Nutch or any plugin have
Hi,
During crawling/indexing time, I want to do some additional processing
with the raw html Nutch just crawled and I want to save an additional custom
data based on the raw html for later retrieval. Should I save these
additional custom data to crawlDB or Segment or somewhere else? I need to
Hi,
I am trying to find out the encoding and format of the content stored in
the index. I modified the code in BasicIndexFilter.java to store the
content. But I need to know the encoding of the stored content which
doesn't seem to store this information. I also need to know whether it's
it?
dealmaker wrote:
I am using the nutch nightly build #741 (Mar 3, 2009 4:01:53 AM). I am
at the final phrase of crawling following the tutorial on Nutch.org
website. I ran the following command, and I got exception in Hadoop. I
double checked the folder path in nutch-site.xml
I modified the index code to make it store the html after fetching. But when
I use getContent( ) to get the html, it returns HTML entities like lt
instead of . Is there any purpose to store the html as html entities?
Can I just store the html as regular html, not html entities?
Thanks.
--
Hi,
I am writing code to make Nutch to fetch files in relative path in a html
page.
The format of url of the webpage can be
http://www.mysite.com/folder1/page.html or http://www.mysite.com/folder1
The format of path of the file can be ../../image.jpg, or
http://www.example.com/image.jpg;, or
Hi,
I am using bean in a class without servlet. How do I retrieve Bean with
the conf without using servlet? I see the following code in
OpenSearchServlet.java. I need conf to get Bean, but both NutchBean.get ( )
and NutchConfiguration.get ( ) needs servlet. How do I go around that?
Thanks.
26 matches
Mail list logo