Hi
I would start from another. Crawling such sites is not an easy task. Your
parser shall be very smart.
I would investigate if your web sites have public API which could be used to
do searches and then agregating results into one set.
Best Regards
Alexander Aristov
On 12 September 2011
Yes this is true (if the assumption is that by 'compar' you are referring to
'compare'), but it sounds like you are after some kind of similarity
anaysis.
As Markus said, this is not thr fetchers job but another element of the
Nutch architecture.
In general... I would aks you to understand how
Hi Markus,
Please correct me if I'm wrong, but isn't there a document signature check
to detect if the page contains same content with some other already parsed
and indexed.
Dinçer
2011/9/12 Markus Jelsma markus.jel...@openindex.io
Hi,
Would it not be a good idea to patch DomContentUtils
Hi Lewis,
My Fault, sorry for that..br/
I had enabled some of the logging for httpclient. Please find attached log
file.br/
Please let me know if you need more information on this.br/
http://lucene.472066.n3.nabble.com/file/n3332184/hadoop.log hadoop.log
--
View this message in context:
On Tuesday 13 September 2011 13:12:41 Alexander Aristov wrote:
yes you can stop but how do you know if a URL is good or not?
You can use URL filter to discard unwanted URLs.
We see that many sites with relative URL's without base href produce erroneous
links. As with the example there is a
Hi,
Another complaint on Nutch' handling of outlinks. Since NUTCH-436 there is
better support for embedded segment parameters. This exotic feature, however,
causes a lot of invalid outlinks to be generated.
For some reason (most likely bad webmasters like my other thread) i see a lot
of
Committed in 1.4 per https://issues.apache.org/jira/browse/NUTCH-1110
Thanks for pointing me to Nutch-1029. This will fix it. It seems
completely Hadoop related though, because Nutch tries to open map files
using merely Hadoop code. CrawlDbReader delegates it to
Comparing with nutch-1.2 I do not see any content folder under segments ones.
Does this mean that we cannot put store.content to false in nutch1-3?
Thanks.
Alex.
--
View this message in context:
http://lucene.472066.n3.nabble.com/Crawl-fails-Input-path-does-not-exist-tp996823p3334709.html
Sent
Hello
I want to search on articles via Solr. So need to find only specific files
like doc, docx, and pdf.
I don't need any html pages. Thus the result of our search should only
consists of doc, docx, and pdf files.
I'm using Nutch to crawling web pages and sending Nutch's data to Solr for
9 matches
Mail list logo