Re: [ANNOUNCE] Apache Nutch 1.0

2009-03-29 Thread Ryan Smith
Dennis, Thanks a lot.
-Ryan

2009/3/28 Tony Wang ivyt...@gmail.com

 Hi Sami,

 Thank you so much for the good news. Is there going to be documentation for
 Solr integration? Sorry to Otis, I know you are going to ask me to try to
 find it out by myself ;)

 Thanks! - Tony

 On Sat, Mar 28, 2009 at 1:53 PM, Sami Siren ssi...@gmail.com wrote:

  I am pleased to announce the availability of  Apache Nutch 1.0.
 
  Apache Nutch, a subproject of Apache Lucene, is open source web-search
  software. It builds on Lucene Java, adding web-specifics, such as a
 crawler,
  a link-graph database, parsers for HTML and other document formats.
 
  Apache Nutch 1.0 contains a number of bug fixes and improvements such as
  Solr Integration, new indexing framework and new scoring framework just
 to
  mention a few. Details can be found in the changes file:
 
 
 http://svn.apache.org/repos/asf/lucene/nutch/tags/release-1.0/CHANGES.txt
 
  Apache Nutch is available for download from the following download page:
  http://www.apache.org/dyn/closer.cgi/lucene/nutch/nutch-1.0.tar.gz
 
  When downloading from a mirror site, please remember to verify the
  downloads using signatures found on the Apache site:
  http://www.apache.org/dist/lucene/nutch/KEYS
 
  For more information on Apache Nutch, visit the project home page:
  http://lucene.apache.org/nutch
 
  -- Sami Siren (on behalf of the Apache Nutch community)
 



 --
 Are you RCholic? www.RCholic.com
 温 良 恭 俭 让 仁 义 礼 智 信
 ~ ..~
  (oo)



Re: [ANNOUNCE] Apache Nutch 1.0

2009-03-28 Thread Ryan Smith
Is it possible to use heritrix as nutch's crawler?


On Sat, Mar 28, 2009 at 3:53 PM, Sami Siren ssi...@gmail.com wrote:

 I am pleased to announce the availability of  Apache Nutch 1.0.

 Apache Nutch, a subproject of Apache Lucene, is open source web-search
 software. It builds on Lucene Java, adding web-specifics, such as a crawler,
 a link-graph database, parsers for HTML and other document formats.

 Apache Nutch 1.0 contains a number of bug fixes and improvements such as
 Solr Integration, new indexing framework and new scoring framework just to
 mention a few. Details can be found in the changes file:

 http://svn.apache.org/repos/asf/lucene/nutch/tags/release-1.0/CHANGES.txt

 Apache Nutch is available for download from the following download page:
 http://www.apache.org/dyn/closer.cgi/lucene/nutch/nutch-1.0.tar.gz

 When downloading from a mirror site, please remember to verify the
 downloads using signatures found on the Apache site:
 http://www.apache.org/dist/lucene/nutch/KEYS

 For more information on Apache Nutch, visit the project home page:
 http://lucene.apache.org/nutch

 -- Sami Siren (on behalf of the Apache Nutch community)



Re: [ANNOUNCE] Apache Nutch 1.0

2009-03-28 Thread Dennis Kubes
To a point yes.  Heritrix will output in arc format.  Then you can use 
the o.a.n.tools.arc.ArcSegmentsCreator to convert the arc files to 
segments.  From there you can run other tools on the segments as normal. 
 What you won't get is Heritrix access to the crawldb.


Dennis

Ryan Smith wrote:

Is it possible to use heritrix as nutch's crawler?


On Sat, Mar 28, 2009 at 3:53 PM, Sami Siren ssi...@gmail.com wrote:


I am pleased to announce the availability of  Apache Nutch 1.0.

Apache Nutch, a subproject of Apache Lucene, is open source web-search
software. It builds on Lucene Java, adding web-specifics, such as a crawler,
a link-graph database, parsers for HTML and other document formats.

Apache Nutch 1.0 contains a number of bug fixes and improvements such as
Solr Integration, new indexing framework and new scoring framework just to
mention a few. Details can be found in the changes file:

http://svn.apache.org/repos/asf/lucene/nutch/tags/release-1.0/CHANGES.txt

Apache Nutch is available for download from the following download page:
http://www.apache.org/dyn/closer.cgi/lucene/nutch/nutch-1.0.tar.gz

When downloading from a mirror site, please remember to verify the
downloads using signatures found on the Apache site:
http://www.apache.org/dist/lucene/nutch/KEYS

For more information on Apache Nutch, visit the project home page:
http://lucene.apache.org/nutch

-- Sami Siren (on behalf of the Apache Nutch community)





Re: [ANNOUNCE] Apache Nutch 1.0

2009-03-28 Thread Ryan Smith
Dennis,
Thank you.  Ok, then one other question please :).  I want to use heritrix,
and the plugin for heritrix that writes records directly to hbase using
hbase-writer:
http://code.google.com/p/hbase-writer/
(Hbase runs on top of hadoop)
Would it be feasible/make sense for someone (maybe myself) to write a new
plugin for nutch to read its input data from hbase tables instead of arc
files?
Thanks again.
-Ryan

On Sat, Mar 28, 2009 at 5:22 PM, Dennis Kubes ku...@apache.org wrote:

 To a point yes.  Heritrix will output in arc format.  Then you can use the
 o.a.n.tools.arc.ArcSegmentsCreator to convert the arc files to segments.
  From there you can run other tools on the segments as normal.  What you
 won't get is Heritrix access to the crawldb.

 Dennis


 Ryan Smith wrote:

 Is it possible to use heritrix as nutch's crawler?


 On Sat, Mar 28, 2009 at 3:53 PM, Sami Siren ssi...@gmail.com wrote:

  I am pleased to announce the availability of  Apache Nutch 1.0.

 Apache Nutch, a subproject of Apache Lucene, is open source web-search
 software. It builds on Lucene Java, adding web-specifics, such as a
 crawler,
 a link-graph database, parsers for HTML and other document formats.

 Apache Nutch 1.0 contains a number of bug fixes and improvements such as
 Solr Integration, new indexing framework and new scoring framework just
 to
 mention a few. Details can be found in the changes file:

 http://svn.apache.org/repos/asf/lucene/nutch/tags/release-1.0/CHANGES.txt

 Apache Nutch is available for download from the following download page:
 http://www.apache.org/dyn/closer.cgi/lucene/nutch/nutch-1.0.tar.gz

 When downloading from a mirror site, please remember to verify the
 downloads using signatures found on the Apache site:
 http://www.apache.org/dist/lucene/nutch/KEYS

 For more information on Apache Nutch, visit the project home page:
 http://lucene.apache.org/nutch

 -- Sami Siren (on behalf of the Apache Nutch community)





Re: [ANNOUNCE] Apache Nutch 1.0

2009-03-28 Thread Dennis Kubes

That is already in the works.  See:

https://issues.apache.org/jira/browse/NUTCH-650

Dennis

Ryan Smith wrote:

Dennis,
Thank you.  Ok, then one other question please :).  I want to use heritrix,
and the plugin for heritrix that writes records directly to hbase using
hbase-writer:
http://code.google.com/p/hbase-writer/
(Hbase runs on top of hadoop)
Would it be feasible/make sense for someone (maybe myself) to write a new
plugin for nutch to read its input data from hbase tables instead of arc
files?
Thanks again.
-Ryan

On Sat, Mar 28, 2009 at 5:22 PM, Dennis Kubes ku...@apache.org wrote:


To a point yes.  Heritrix will output in arc format.  Then you can use the
o.a.n.tools.arc.ArcSegmentsCreator to convert the arc files to segments.
 From there you can run other tools on the segments as normal.  What you
won't get is Heritrix access to the crawldb.

Dennis


Ryan Smith wrote:


Is it possible to use heritrix as nutch's crawler?


On Sat, Mar 28, 2009 at 3:53 PM, Sami Siren ssi...@gmail.com wrote:

 I am pleased to announce the availability of  Apache Nutch 1.0.

Apache Nutch, a subproject of Apache Lucene, is open source web-search
software. It builds on Lucene Java, adding web-specifics, such as a
crawler,
a link-graph database, parsers for HTML and other document formats.

Apache Nutch 1.0 contains a number of bug fixes and improvements such as
Solr Integration, new indexing framework and new scoring framework just
to
mention a few. Details can be found in the changes file:

http://svn.apache.org/repos/asf/lucene/nutch/tags/release-1.0/CHANGES.txt

Apache Nutch is available for download from the following download page:
http://www.apache.org/dyn/closer.cgi/lucene/nutch/nutch-1.0.tar.gz

When downloading from a mirror site, please remember to verify the
downloads using signatures found on the Apache site:
http://www.apache.org/dist/lucene/nutch/KEYS

For more information on Apache Nutch, visit the project home page:
http://lucene.apache.org/nutch

-- Sami Siren (on behalf of the Apache Nutch community)






Re: [ANNOUNCE] Apache Nutch 1.0

2009-03-28 Thread Tony Wang
Hi Sami,

Thank you so much for the good news. Is there going to be documentation for
Solr integration? Sorry to Otis, I know you are going to ask me to try to
find it out by myself ;)

Thanks! - Tony

On Sat, Mar 28, 2009 at 1:53 PM, Sami Siren ssi...@gmail.com wrote:

 I am pleased to announce the availability of  Apache Nutch 1.0.

 Apache Nutch, a subproject of Apache Lucene, is open source web-search
 software. It builds on Lucene Java, adding web-specifics, such as a crawler,
 a link-graph database, parsers for HTML and other document formats.

 Apache Nutch 1.0 contains a number of bug fixes and improvements such as
 Solr Integration, new indexing framework and new scoring framework just to
 mention a few. Details can be found in the changes file:

 http://svn.apache.org/repos/asf/lucene/nutch/tags/release-1.0/CHANGES.txt

 Apache Nutch is available for download from the following download page:
 http://www.apache.org/dyn/closer.cgi/lucene/nutch/nutch-1.0.tar.gz

 When downloading from a mirror site, please remember to verify the
 downloads using signatures found on the Apache site:
 http://www.apache.org/dist/lucene/nutch/KEYS

 For more information on Apache Nutch, visit the project home page:
 http://lucene.apache.org/nutch

 -- Sami Siren (on behalf of the Apache Nutch community)




-- 
Are you RCholic? www.RCholic.com
温 良 恭 俭 让 仁 义 礼 智 信
~ ..~
 (oo)