I am not able to make any nutch query work. I know it is something
simple. Could someone take a look at what I am doing..
Here is the code I am using, it is pretty simple:
NutchBean bean = new NutchBean(conf);
Query query = Query.parse("title:credit", conf);
Hits hits = bean.search(query, 10);
System.out.println("hits.getLength()=>" +
hits.getLength());
The configuration is the exact same configuration I am using to create
the indexes. The very same object. Pointing Luke at these indexes and
issuing the above search yields plenty of hits. Yet this yields no hits.
The other factor in the set up in nutch-site.xml. I added the following
which points me to the root directory of my newly created indexes.
<property>
<name>searcher.dir</name>
<value>outputDir</value>
<description>
Text Removed from this email, remains in code
</description>
</property>
Query returns zero hits. Tried several things, no luck. Can you help me
out?
ray
On Mar 3, 2009, at 7:14 PM, [email protected] wrote:
>
> Hi,
>
> I will need to index all links in domains then. What do you think a
> linux box like yours with DSL connection is OK to index the domains
> I have?
>
> Why only segments? I thought we need to merge all sub folders under
> crawl folder. What did you use for merging them?
>
> Thanks.
> A.
>
>
>
>
>
>
>
> -----Original Message-----
> From: John Martyniak <[email protected]>
> To: [email protected]
> Sent: Tue, 3 Mar 2009 3:21 pm
> Subject: Re: what is needed to index for about 10000 domains
>
>
>
>
>
>
>
>
>
> Well the way that nutch works is that you would inject your list of
> domains into the DB, and that would be the starting point. Since
> nutch uses a crawler it would grab those pages, and determine if
> there are any links on those pages, and then add them to the DB. So
> the next time that you generated your urls to fetch, it would take
> your original list, plus the ones that it found to generate the new
> segment.?
> ?
>
> If you wanted to limit it to only pages contained on your 10000
> domains, you could use the regex-urlfilter.txt file in the conf
> directory to limit it to your list. But you would have to create a
> regular expression for each one.?
> ?
>
> I am not familiar with the merge script on the wiki, but have merged
> segments before and it did work. But that was on Linux, don't think
> that should make a difference though.?
> ?
>
> -John?
> ?
>
> ?
>
> On Mar 3, 2009, at 5:10 PM, [email protected] wrote:?
> ?
>
>> ?
>
>> Hi,?
>
>> ?
>
>> Thanks for the reply. I have list? of those domains only. I am not
>> > sure how many pages they have. Is a DSL? connection sufficient to
>> > run nutch in my case. Did you run nutch for all of your pages at
>> > once or separately for a given subset of them. Btw, yesterday I >
>> tried to use merge shell script that we have on wiki. It gave a lot
>> > of errors. I run it on cygwin though.?
>
>> ?
>
>> Thanks.?
>
>> A.?
>
>> ?
>
>> ?
>
>> ?
>
>> ?
>
>> ?
>
>> ?
>
>> ?
>
>> -----Original Message-----?
>
>> From: John Martyniak <[email protected]>?
>
>> To: [email protected]?
>
>> Sent: Tue, 3 Mar 2009 1:44 pm?
>
>> Subject: Re: what is needed to index for about 10000 domains?
>
>> ?
>
>> ?
>
>> ?
>
>> ?
>
>> ?
>
>> ?
>
>> ?
>
>> ?
>
>> ?
>
>> I think that in order to answer that questions, it is necessary to
>> > know how many total pages are being indexed.??
>
>> ??
>
>> ?
>
>> I currently have ~3.5 million pages indexed, and the segment >
>> directories are around 45GB, The response time is relatively fast.??
>
>> ??
>
>> ?
>
>> In the test site it is running on a dual processor Dell 1850 with >
>> 3GB of RAM.??
>
>> ??
>
>> ?
>
>> -John??
>
>> ??
>
>> ?
>
>> On Mar 3, 2009, at 3:44 PM, [email protected] wrote:??
>
>> ??
>
>> ?
>
>>> Hello,??
>
>> ?
>
>>> ??
>
>> ?
>
>>> I use nutch-0.9 and need to index about 10000? domains.? I want to
>>> >> > know? minimum requirements to hardware and memory.??
>
>> ?
>
>>> ??
>
>> ?
>
>>> Thanks in advance.??
>
>> ?
>
>>> Alex.??
>
>> ??
>
>> ?
>
>> ?
>
>> ?
>
>> ?
>
>> ?
> ?
>
>
>
>
>