Re: [ANNOUNCE] Apache Nutch 1.0
Dennis, Thanks a lot. -Ryan 2009/3/28 Tony Wang ivyt...@gmail.com Hi Sami, Thank you so much for the good news. Is there going to be documentation for Solr integration? Sorry to Otis, I know you are going to ask me to try to find it out by myself ;) Thanks! - Tony On Sat, Mar 28, 2009 at 1:53 PM, Sami Siren ssi...@gmail.com wrote: I am pleased to announce the availability of Apache Nutch 1.0. Apache Nutch, a subproject of Apache Lucene, is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats. Apache Nutch 1.0 contains a number of bug fixes and improvements such as Solr Integration, new indexing framework and new scoring framework just to mention a few. Details can be found in the changes file: http://svn.apache.org/repos/asf/lucene/nutch/tags/release-1.0/CHANGES.txt Apache Nutch is available for download from the following download page: http://www.apache.org/dyn/closer.cgi/lucene/nutch/nutch-1.0.tar.gz When downloading from a mirror site, please remember to verify the downloads using signatures found on the Apache site: http://www.apache.org/dist/lucene/nutch/KEYS For more information on Apache Nutch, visit the project home page: http://lucene.apache.org/nutch -- Sami Siren (on behalf of the Apache Nutch community) -- Are you RCholic? www.RCholic.com 温 良 恭 俭 让 仁 义 礼 智 信 ~ ..~ (oo)
Re: [ANNOUNCE] Apache Nutch 1.0
Is it possible to use heritrix as nutch's crawler? On Sat, Mar 28, 2009 at 3:53 PM, Sami Siren ssi...@gmail.com wrote: I am pleased to announce the availability of Apache Nutch 1.0. Apache Nutch, a subproject of Apache Lucene, is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats. Apache Nutch 1.0 contains a number of bug fixes and improvements such as Solr Integration, new indexing framework and new scoring framework just to mention a few. Details can be found in the changes file: http://svn.apache.org/repos/asf/lucene/nutch/tags/release-1.0/CHANGES.txt Apache Nutch is available for download from the following download page: http://www.apache.org/dyn/closer.cgi/lucene/nutch/nutch-1.0.tar.gz When downloading from a mirror site, please remember to verify the downloads using signatures found on the Apache site: http://www.apache.org/dist/lucene/nutch/KEYS For more information on Apache Nutch, visit the project home page: http://lucene.apache.org/nutch -- Sami Siren (on behalf of the Apache Nutch community)
Re: [ANNOUNCE] Apache Nutch 1.0
To a point yes. Heritrix will output in arc format. Then you can use the o.a.n.tools.arc.ArcSegmentsCreator to convert the arc files to segments. From there you can run other tools on the segments as normal. What you won't get is Heritrix access to the crawldb. Dennis Ryan Smith wrote: Is it possible to use heritrix as nutch's crawler? On Sat, Mar 28, 2009 at 3:53 PM, Sami Siren ssi...@gmail.com wrote: I am pleased to announce the availability of Apache Nutch 1.0. Apache Nutch, a subproject of Apache Lucene, is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats. Apache Nutch 1.0 contains a number of bug fixes and improvements such as Solr Integration, new indexing framework and new scoring framework just to mention a few. Details can be found in the changes file: http://svn.apache.org/repos/asf/lucene/nutch/tags/release-1.0/CHANGES.txt Apache Nutch is available for download from the following download page: http://www.apache.org/dyn/closer.cgi/lucene/nutch/nutch-1.0.tar.gz When downloading from a mirror site, please remember to verify the downloads using signatures found on the Apache site: http://www.apache.org/dist/lucene/nutch/KEYS For more information on Apache Nutch, visit the project home page: http://lucene.apache.org/nutch -- Sami Siren (on behalf of the Apache Nutch community)
Re: [ANNOUNCE] Apache Nutch 1.0
Dennis, Thank you. Ok, then one other question please :). I want to use heritrix, and the plugin for heritrix that writes records directly to hbase using hbase-writer: http://code.google.com/p/hbase-writer/ (Hbase runs on top of hadoop) Would it be feasible/make sense for someone (maybe myself) to write a new plugin for nutch to read its input data from hbase tables instead of arc files? Thanks again. -Ryan On Sat, Mar 28, 2009 at 5:22 PM, Dennis Kubes ku...@apache.org wrote: To a point yes. Heritrix will output in arc format. Then you can use the o.a.n.tools.arc.ArcSegmentsCreator to convert the arc files to segments. From there you can run other tools on the segments as normal. What you won't get is Heritrix access to the crawldb. Dennis Ryan Smith wrote: Is it possible to use heritrix as nutch's crawler? On Sat, Mar 28, 2009 at 3:53 PM, Sami Siren ssi...@gmail.com wrote: I am pleased to announce the availability of Apache Nutch 1.0. Apache Nutch, a subproject of Apache Lucene, is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats. Apache Nutch 1.0 contains a number of bug fixes and improvements such as Solr Integration, new indexing framework and new scoring framework just to mention a few. Details can be found in the changes file: http://svn.apache.org/repos/asf/lucene/nutch/tags/release-1.0/CHANGES.txt Apache Nutch is available for download from the following download page: http://www.apache.org/dyn/closer.cgi/lucene/nutch/nutch-1.0.tar.gz When downloading from a mirror site, please remember to verify the downloads using signatures found on the Apache site: http://www.apache.org/dist/lucene/nutch/KEYS For more information on Apache Nutch, visit the project home page: http://lucene.apache.org/nutch -- Sami Siren (on behalf of the Apache Nutch community)
Re: [ANNOUNCE] Apache Nutch 1.0
That is already in the works. See: https://issues.apache.org/jira/browse/NUTCH-650 Dennis Ryan Smith wrote: Dennis, Thank you. Ok, then one other question please :). I want to use heritrix, and the plugin for heritrix that writes records directly to hbase using hbase-writer: http://code.google.com/p/hbase-writer/ (Hbase runs on top of hadoop) Would it be feasible/make sense for someone (maybe myself) to write a new plugin for nutch to read its input data from hbase tables instead of arc files? Thanks again. -Ryan On Sat, Mar 28, 2009 at 5:22 PM, Dennis Kubes ku...@apache.org wrote: To a point yes. Heritrix will output in arc format. Then you can use the o.a.n.tools.arc.ArcSegmentsCreator to convert the arc files to segments. From there you can run other tools on the segments as normal. What you won't get is Heritrix access to the crawldb. Dennis Ryan Smith wrote: Is it possible to use heritrix as nutch's crawler? On Sat, Mar 28, 2009 at 3:53 PM, Sami Siren ssi...@gmail.com wrote: I am pleased to announce the availability of Apache Nutch 1.0. Apache Nutch, a subproject of Apache Lucene, is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats. Apache Nutch 1.0 contains a number of bug fixes and improvements such as Solr Integration, new indexing framework and new scoring framework just to mention a few. Details can be found in the changes file: http://svn.apache.org/repos/asf/lucene/nutch/tags/release-1.0/CHANGES.txt Apache Nutch is available for download from the following download page: http://www.apache.org/dyn/closer.cgi/lucene/nutch/nutch-1.0.tar.gz When downloading from a mirror site, please remember to verify the downloads using signatures found on the Apache site: http://www.apache.org/dist/lucene/nutch/KEYS For more information on Apache Nutch, visit the project home page: http://lucene.apache.org/nutch -- Sami Siren (on behalf of the Apache Nutch community)
Re: [ANNOUNCE] Apache Nutch 1.0
Hi Sami, Thank you so much for the good news. Is there going to be documentation for Solr integration? Sorry to Otis, I know you are going to ask me to try to find it out by myself ;) Thanks! - Tony On Sat, Mar 28, 2009 at 1:53 PM, Sami Siren ssi...@gmail.com wrote: I am pleased to announce the availability of Apache Nutch 1.0. Apache Nutch, a subproject of Apache Lucene, is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats. Apache Nutch 1.0 contains a number of bug fixes and improvements such as Solr Integration, new indexing framework and new scoring framework just to mention a few. Details can be found in the changes file: http://svn.apache.org/repos/asf/lucene/nutch/tags/release-1.0/CHANGES.txt Apache Nutch is available for download from the following download page: http://www.apache.org/dyn/closer.cgi/lucene/nutch/nutch-1.0.tar.gz When downloading from a mirror site, please remember to verify the downloads using signatures found on the Apache site: http://www.apache.org/dist/lucene/nutch/KEYS For more information on Apache Nutch, visit the project home page: http://lucene.apache.org/nutch -- Sami Siren (on behalf of the Apache Nutch community) -- Are you RCholic? www.RCholic.com 温 良 恭 俭 让 仁 义 礼 智 信 ~ ..~ (oo)