[Nutch-dev] Re: Extract infos from documents and query external sites

Stefan Groschupf Tue, 30 May 2006 11:42:03 -0700

Think about using the google API.

However the way to go could be:


+ fetch your pages
+ do not parse the pages
+ write a map reduce job that extract your data
++ make a xhtml dom from the html e.g. using neko
++ use xpath queries to extract your data

++ also check out gate as a named entity extraction tool to extractnames based on patterns and heuristics.

++ write the names in a file.

+ build your query urls
+ inject the query urls in a empty crawl db

+ create a segment fetch it and update the segment agains a secondempty crawl database

+ remove the first segment and db
+ create a segment with your second db and fetch it.
You second segment will only contains the paper pages.

HTH
Stefan




Am 30.05.2006 um 12:14 schrieb HellSpawn:

I'm working on a search engine for my university and they want meto do that
to create a repository of scientific articles on the web :D
I red something about xpath for extracting exact parts from adocument, oncedone this building the query is very easy but my doubts are abouthow to
insert all of this in the nutch crawler...

Thank you
--
View this message in context: http://www.nabble.com/Extract+infos+from+documents+and+query+external+sites-t1675003.html#a4624272
Sent from the Nutch - Dev forum at Nabble.com.




-------------------------------------------------------
All the advantages of Linux Managed Hosting--Without the Cost and Risk!
Fully trained technicians. The highest number of Red Hat certifications in
the hosting industry. Fanatical Support. Click to learn more
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=107521&bid=248729&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: Extract infos from documents and query external sites

Reply via email to