Think about using the google API.
However the way to go could be:
+ fetch your pages
+ do not parse the pages
+ write a map reduce job that extract your data
++ make a xhtml dom from the html e.g. using neko
++ use xpath queries to extract your data
++ also check out gate as a named entity extraction tool to extract
names based on patterns and heuristics.
++ write the names in a file.
+ build your query urls
+ inject the query urls in a empty crawl db
+ create a segment fetch it and update the segment agains a second
empty crawl database
+ remove the first segment and db
+ create a segment with your second db and fetch it.
You second segment will only contains the paper pages.
HTH
Stefan
Am 30.05.2006 um 12:14 schrieb HellSpawn:
I'm working on a search engine for my university and they want me
to do that
to create a repository of scientific articles on the web :D
I red something about xpath for extracting exact parts from a
document, once
done this building the query is very easy but my doubts are about
how to
insert all of this in the nutch crawler...
Thank you
--
View this message in context: http://www.nabble.com/Extract+infos
+from+documents+and+query+external+sites-t1675003.html#a4624272
Sent from the Nutch - Dev forum at Nabble.com.
-------------------------------------------------------
All the advantages of Linux Managed Hosting--Without the Cost and Risk!
Fully trained technicians. The highest number of Red Hat certifications in
the hosting industry. Fanatical Support. Click to learn more
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=107521&bid=248729&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers