What if we ask it to download 1000 bytes from beginning and the same amount 
from the end and ignore the rest?? 
I need this to index mp3 files, since their data are either at the top or end. 
My goal is to? have nutch not to spend time to download whole files.

Thanks.
A.


 -----Original Message-----

From: Jasper Kamperman <[email protected]>
To: [email protected]
Sent: Tue, 3 Mar 2009 10:56 pm
Subject: Re: what is needed to index for about 10000 domains









I don't think there is a standard way to tell Nutch to download the end of a 
file (since you have to download the beginning before you get to the end).?
?

But if I understand what you want, you could write a custom IndexFilter that 
indexes only the tail of a file, see 
http://wiki.apache.org/nutch/WritingPluginExample-0.9  for a start on how to do 
this.?
?

On Mar 3, 2009, at 8:48 PM, [email protected] wrote:?
?

>?

> I never tried to test this configuration. What about asking nutch to > 
> download a certain amount of byes from the end of files??

>?

>?

>?

>?

>?

>?

>?

> -----Original Message-----?

> From: Jasper Kamperman <[email protected]>?

> To: [email protected]?

> Sent: Tue, 3 Mar 2009 8:32 pm?

> Subject: Re: what is needed to index for about 10000 domains?

>?

>?

>?

>?

>?

>?

>?

>?

>?

> One reason to store content is if you want to show snippets in > search 
> results.??

> ??

>?

> Another reason is if you want to have a "cached" feature where you > can give 
> the user the page as it looked when you crawled it (it may > since have 
> disappeared.??

> ??

>?

> There is a way to tell nutch to look at only the beginning of a > file, it's 
> this section in your config.xml:??

> ??

>?

> <property>??

>?

> ? <name>file.content.limit</name>??

>?

> ? <value>65536</value>??

>?

> ? <description>The length limit for downloaded content, in bytes.??

>?

> ? If this value is nonnegative (>=0), content longer than it will be > 
> truncated;??

>?

> ? otherwise, no truncation at all.??

>?

> ? </description>??

>?

> </property>??

> ??

>?

> this is from the nutch-default.xml in 0.9, don't know whether it has > 
> changed in 1.0 .??

> ??

>?

> On Mar 3, 2009, at 8:27 PM, [email protected] wrote:??

> ??

>?

>> ??

>?

>> Hi,??

>?

>> ??

>?

>> I also noticed that we can disable storing content of pages which I >> > 
>> use. I wonder why someone needs to store content? Also, in case >> of > 
>> files, is there a way to tell nutch not to download the whole >> file > but 
>> let say 1000 bytes from the beginning and parse and >> index > information 
>> only in that part of files???

>?

>> ??

>?

>> Thanks.??

>?

>> A.??

>?

>> ??

>?

>> ??

>?

>> ??

>?

>> ??

>?

>> ??

>?

>> ??

>?

>> ??

>?

>> -----Original Message-----??

>?

>> From: yanky young <[email protected]>??

>?

>> To: [email protected]??

>?

>> Sent: Tue, 3 Mar 2009 6:41 pm??

>?

>> Subject: Re: what is needed to index for about 10000 domains??

>?

>> ??

>?

>> ??

>?

>> ??

>?

>> ??

>?

>> ??

>?

>> ??

>?

>> ??

>?

>> ??

>?

>> ??

>?

>> ??

>?

>> Hi:??

>?

>> ??

>?

>> Dog cutting has ever write a wiki about hardware requirement of > >> nutch, 
>> you??

>?

>> can check it out??

>?

>> ??

>?

>> http://wiki.apache.org/nutch/HardwareRequirements??

>?

>> ??

>?

>> good luck??

>?

>> ??

>?

>> yanky??

>?

>> ??

>?

>> ??

>?

>> 2009/3/4 John Martyniak <[email protected]>??

>?

>> ??

>?

>>> Regarding the machine, you could run it on anything,  it all >> >>> depends 
>>> what??

>?

>>> kind of performance you want.  So yes you could run it on the >> >>> 
>>> machine that??

>?

>>> you have or something like the linux machine that I have.  And the >>> >> 
>>> DSL??

>?

>>> connection should be fine, you just need to make sure that it >>> stays >> 
>>> up the??

>?

>>> whole time, because if not it will start erring out, and you will >>> >> 
>>> have to??

>?

>>> re-fetch that whole segment as there is no way to pick up from >> >>> where 
>>> you??

>?

>>> left off.??

>?

>>> ??

>?

>>> The only reason that I merged the segments was that I had many of >>> >> 
>>> them, and??

>?

>>> I wanted to build a big one before I started creating new ones, >> >>> 
>>> another??

>?

>>> advantage of merging is that you can use that to clear out >>> unwanted >> 
>>> urls.??

>?

>>> For example I had a bunch of .js files in there, that I didn't >>> want >> 
>>> to have??

>?

>>> as part of the index, so I cleared them out.??

>?

>>> ??

>?

>>> I used "bin/nutch mergesegs".??

>?

>>> ??

>?

>>> Regarding merging the other parts, I have never used, but I don't >>> >> 
>>> think??

>?

>>> that it is necessary unless you have multiple linkdbs, etc.  in my >>> >> 
>>> case I do??

>?

>>> not.??

>?

>>> ??

>?

>>> -John??

>?

>>> ??

>?

>>> ??

>?

>>> ??

>?

>>> ??

>?

>>> On Mar 3, 2009, at 7:14 PM, [email protected] wrote:??

>?

>>> ??

>?

>>> ??

>?

>>>> Hi,??

>?

>>>> ??

>?

>>>> I will need to index all links in domains then. What do you think >>>> >>> 
>>>> a linux??

>?

>>>> box like yours with DSL connection is OK to index the domains I >>>> >>> 
>>>> have???

>?

>>>> ??

>?

>>>> Why only segments? I thought we need to merge all sub folders >>> >>>> 
>>>> under crawl??

>?

>>>> folder. What did you use for merging them???

>?

>>>> ??

>?

>>>> Thanks.??

>?

>>>> A.??

>?

>>>> ??

>?

>>>> ??

>?

>>>> ??

>?

>>>> ??

>?

>>>> ??

>?

>>>> ??

>?

>>>> ??

>?

>>>> -----Original Message-----??

>?

>>>> From: John Martyniak <[email protected]>??

>?

>>>> To: [email protected]??

>?

>>>> Sent: Tue, 3 Mar 2009 3:21 pm??

>?

>>>> Subject: Re: what is needed to  index for about 10000 domains??

>?

>>>> ??

>?

>>>> ??

>?

>>>> ??

>?

>>>> ??

>?

>>>> ??

>?

>>>> ??

>?

>>>> ??

>?

>>>> ??

>?

>>>> ??

>?

>>>> Well the way that nutch works is that you would inject your list >>>> of??

>?

>>>> domains into the DB, and that would be the starting point.  Since >>>> >>> 
>>>> nutch uses??

>?

>>>> a crawler it would grab those pages, and determine if there are >>>> >>> 
>>>> any links on??

>?

>>>> those pages, and then add them to the DB.  So the next time that >>>> you??

>?

>>>> generated your urls to fetch, it would take your original list, >>>> >>> 
>>>> plus the??

>?

>>>> ones that it found to generate the new segment.???

>?

>>>> ???

>?

>>>> ??

>?

>>>> If you wanted to limit it to only pages contained on your 10000 >>>> >>> 
>>>> domains,??

>?

>>>> you could use the regex-urlfilter.txt file in the conf directory >>>> >>> 
>>>> to limit it??

>?

>>>> to your list.  But you would have to create a regular expression >>>> >>> 
>>>> for each??

>?

>>>> one.???

>?

>>>> ???

>?

>>>> ??

>?

>>>> I am not familiar with the merge script on the wiki, but have >>>> merged??

>?

>>>> segments before and it did work.  But that was on Linux, don't >>>> >>> 
>>>> think that??

>?

>>>> should make a difference though.???

>?

>>>> ???

>?

>>>> ??

>?

>>>> -John???

>?

>>>> ???

>?

>>>> ??

>?

>>>> ???

>?

>>>> ??

>?

>>>> On Mar 3, 2009, at 5:10 PM, [email protected] wrote:???

>?

>>>> ???

>?

>>>> ??

>?

>>>> ???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> Hi,???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> ???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> Thanks for the reply. I have list? of those domains only. I am >>>> not 
>>>> >>> > sure??

>?

>>>>> how many pages they have. Is a DSL? connection sufficient to > >>>>> >>>> 
>>>>> run nutch in??

>?

>>>>> my case. Did you run nutch for all of your pages at > once or >>>>> >>>> 
>>>>> separately for??

>?

>>>>> a given subset of them. Btw, yesterday I > tried to use merge >>>>> >>>> 
>>>>> shell script??

>?

>>>>> that we have on wiki. It gave a lot > of errors. I run it on >>>>> >>>> 
>>>>> cygwin though.???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> ???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> Thanks.???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> A.???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> ???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> ???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> ???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> ???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> ???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> ???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> ???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> -----Original Message-----???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> From: John Martyniak <[email protected]>???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> To: [email protected]???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> Sent: Tue, 3 Mar 2009 1:44 pm???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> Subject: Re: what is needed to  index for about 10000 domains???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> ???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> ???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> ???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> ???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> ???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> ???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> ???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> ???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> ???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> I think that in order to answer that questions, it is necessary >>>> to 
>>>> >>> > know??

>?

>>>>> how many total pages are being indexed.????

>?

>>>>> ??

>?

>>>> ??

>?

>>>> ????

>?

>>>>> ??

>?

>>>> ??

>?

>>>> ???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> I currently have ~3.5 million pages indexed, and the segment >??

>?

>>>>> directories are around 45GB, The response time is relatively >>>>> >>>> 
>>>>> fast.????

>?

>>>>> ??

>?

>>>> ??

>?

>>>> ????

>?

>>>>> ??

>?

>>>> ??

>?

>>>> ???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> In the test site it is running on a dual processor Dell 1850 with >>>> >>> 
>>>> > 3GB??

>?

>>>>> of RAM.????

>?

>>>>> ??

>?

>>>> ??

>?

>>>> ????

>?

>>>>> ??

>?

>>>> ??

>?

>>>> ???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> -John????

>?

>>>>> ??

>?

>>>> ??

>?

>>>> ????

>?

>>>>> ??

>?

>>>> ??

>?

>>>> ???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> On Mar 3, 2009, at 3:44 PM, [email protected] wrote:????

>?

>>>>> ??

>?

>>>> ??

>?

>>>> ????

>?

>>>>> ??

>?

>>>> ??

>?

>>>> ???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> Hello,????

>?

>>>>>> ??

>?

>>>>> ??

>?

>>>> ???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> ????

>?

>>>>>> ??

>?

>>>>> ??

>?

>>>> ???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> I use nutch-0.9 and need to index about 10000? domains.? I want >>>> to 
>>>> >>> >> >??

>?

>>>>>> know? minimum requirements to hardware and memory.????

>?

>>>>>> ??

>?

>>>>> ??

>?

>>>> ???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> ????

>?

>>>>>> ??

>?

>>>>> ??

>?

>>>> ???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> Thanks in advance.????

>?

>>>>>> ??

>?

>>>>> ??

>?

>>>> ???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> Alex.????

>?

>>>>>> ??

>?

>>>>> ??

>?

>>>> ????

>?

>>>>> ??

>?

>>>> ??

>?

>>>> ???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> ???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> ???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> ???

>?

>>>>> ??

>?

>>>> ??

>?

>>>> ???

>?

>>>>> ??

>?

>>>> ???

>?

>>>> ??

>?

>>>> ??

>?

>>>> ??

>?

>>>> ??

>?

>>>> ??

>?

>>>> ??

>?

>>> ??

>?

>> ??

>?

>> ??

>?

>> ??

>?

>> ??

>?

>> ??

> ??

>?

>?

>?

>?

>?
?



 

Reply via email to