from:"Doğacan Güney"

Re: Nutch robot hitting our Web servers

2010-12-13 Thread Doğacan Güney

Hi,

Thank you for the email. Can you provide some more information? For example,
how many requests does the bot make per second, does it respect robots.txt, etc?

On Mon, Dec 13, 2010 at 11:28, Chrislip, Ric chrisl...@hartwick.edu wrote:
 For several days now a Nutch robot from IP 174.36.195.29 has been hitting
 our run-time Web servers.  I noticed because our event logs are showing many
 ASP.NET warnings about illegal characters in path.

 Your Web page at http://nutch.apache.org/bot.htm says that you would like
 to hear about any bad behavior.

 I have attached today's log entries from that IP address on one of our
 servers.

 Ric Chrislip
 Senior Programmer/Analyst, E-mail Administrator
 Clark Hall 111
 Hartwick College
 Oneonta, New York, USA
 607-431-4189




-- 
Doğacan Güney

Re: [Nutchbase] Multi-value ParseResult missing

2010-07-22 Thread Doğacan Güney

Hey,

On Thu, Jul 22, 2010 at 00:47, Andrzej Bialecki a...@getopt.org wrote:

 Hi,

 I noticed that nutchbase doesn't use the multi-valued ParseResult, instead
 all parse plugins return a simple Parse. As a consequence, it's not possible
 to return multiple values from parsing a single WebPage, something that
 parsers for compound documents absolutely require (archives, rss, mbox,
 etc). Dogacan - was there a particular reason for this change?


No. Even though I wrote most of the original ParseResult code, I couldn't
wrap my head around as to how to update WebPage (or old TableRow) API to use
ParseResult.


 However, a broader issue here is how to treat compound documents, and links
 to/from them:
  a) record all URLs of child documents (e.g. with the !/ notation, or #
 notation), and create as many WebPage-s as there were archive members. This
 needs some hacks to prevent such urls from being scheduled for fetching.
  b) extend WebPage to allow for multiple content sections and their names
 (and metadata, and ... yuck)
  c) like a) except put a special synthetic mark on the page to prevent
 selection of this page for generation and fetching. This mark would also
 help us to update / remove obsolete sub-documents when their
 container changes.

 I'm leaning towards c).


I was initially leaning towards (a) but I think (c) sounds good too. The
nice thing about (c) is that these documents will correctly get inlinks
(assuming the URL given to them makes sense, so I am thinking for an RSS
feed, this will be the link element), etc. Though this can also be a
problem too. Since in some instances, you may want to refetch a URL that
happens to be a link in a feed.


 Now, when it comes to the ParseResult ... it's not an ideal solution
 either, because it means we have to keep all sub-document results in memory.
 We could avoid it by implementing something that Aperture uses, which is a
 sub-crawler - a concept of a parser plugin for compound formats. The main
 plugin would return a special result code, which basically says this is a
 compound format of type X, and then the caller (ParseUtil?) would use
 SubCrawlerFactory.create(typeX, containerDataStream) to create a parser for
 the container. This parser in turn would simply extract sections of the
 compound document (as streams) and it would pass each stream to the regular
 parsing chain. The caller then needs to iterate over results returned from
 the SubCrawler. What do you think?


This is excellent :) +1.


 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




-- 
Doğacan Güney

Merging in nutchbase

2010-07-10 Thread Doğacan Güney

Hey everyone,

I would like to start merging in nutchbase to trunk, so I am hoping to get
everyone's comments and suggestions on
how to do that.

Some of the other changes in nutchbase (such as deleting nutch's own
indexing system) have already been
incorporated in nutch trunk so I think, the difference between nutchbase and
nutch trunk has been reduced
to scope of NUTCH-650 and NUTCH-811, i.e., abstracting storage away from
nutch.

Unfortunately, AFAICS, there is no easy way to separate NUTCH-650 into
smaller patches. All nutch jobs and all
plugins have to be updated to use the new String, WebPage API and it needs
to be done at once. So if no one has any objections,
I want to create a gigantic patch that applies to current trunk and attach
it into NUTCH-650 and commit it soon (I want to do this
quickly because nutch development speed is picking up again, and I am
worried that issues like NUTCH-843, while making perfect sense,
will wreak havoc on my merging efforts :)

What does everyone think?

-- 
Doğacan Güney

Re: Merging in nutchbase

2010-07-10 Thread Doğacan Güney

Hey everyone,

On Sat, Jul 10, 2010 at 17:43, Mattmann, Chris A (388J) 
chris.a.mattm...@jpl.nasa.gov wrote:

  Hey Guys,

 +1 to Andrzej’s suggestion. I mostly run small scale stuff with Nutch, so
 unless I can run HBase in small scale (or better yet, an embedded SQL db), I
 won’t be as much use! :)


I just want to make clear that this is, indeed, a goal I share. Gora already
has an SQL backend that can use embedded hsqldb. However, there are some
weird bugs (I really hate SQL :), but once I am done fixing all bugs (which
I will be doing today and tomorrow), nutch will run on gora - (embedded
hsqldb) with zero configuration.


 Cheers,
 Chris



 On 7/10/10 7:28 AM, Andrzej Bialecki a...@getopt.org wrote:

 On 2010-07-10 15:24, Julien Nioche wrote:
  I agree with Andrzej that the SQL backend has to be checked and tested on
  nutchbase before we can start porting it to the trunk.

  Moreover I have
  raised an important design issue on the list recently (table per
 fetchround)
  which needs some changes to Gora first and must be discussed, implemented
  and tested in NutchBase before we port it to trunk

 This could go either way, whichever is more convenient - I don't see it
 as something to necessarily withhold the merge. Without the first issue,
 though, we lose the ability to develop, test and run in local mode...


 --
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: *chris.mattm...@jpl.nasa.gov
 *WWW:   *http://sunset.usc.edu/~mattmann/
 *++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++




-- 
Doğacan Güney

Re: Merging in nutchbase

2010-07-10 Thread Doğacan Güney

On Sat, Jul 10, 2010 at 17:28, Andrzej Bialecki a...@getopt.org wrote:

 On 2010-07-10 15:24, Julien Nioche wrote:

 I agree with Andrzej that the SQL backend has to be checked and tested on
 nutchbase before we can start porting it to the trunk.


  Moreover I have
 raised an important design issue on the list recently (table per
 fetchround)
 which needs some changes to Gora first and must be discussed, implemented
 and tested in NutchBase before we port it to trunk


 This could go either way, whichever is more convenient - I don't see it as
 something to necessarily withhold the merge. Without the first issue,
 though, we lose the ability to develop, test and run in local mode...


While I agree with the table per fetch issue, I would like to postpone it
until after the merge. This issue is tricky for a couple of reasons. For
example, AFAIK, cassandra's latest released version
does not support live schema updates so you can not add/delete tables on a
running cassandra machine. I guess we can use super columns as our tables,
then use columns to store
data but that may be sub-optimal.

For SQL, as mentioned below, it is almost done. There is a weird bug where I
do not read back what I just wrote. Once I figure out what's wrong, I think,
it will be good to go.


 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




-- 
Doğacan Güney

Re: [Nutchbase] WebPage class is a generated code?

2010-07-03 Thread Doğacan Güney

Hey,

On Fri, Jul 2, 2010 at 17:26, Mattmann, Chris A (388J) 
chris.a.mattm...@jpl.nasa.gov wrote:

  Hey Guys,

 Since they are generated, +1 to:


- adding a filepattern to svn:ignore to ignore them
- updating build.xml to autogenerate



I created NUTCH-842 to track this problem.


  Cheers,
 Chris




 On 7/2/10 3:24 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote:



 (This question is mostly to Dogacan  Enis, but I encourage anyone familiar
 with the code to join the threads with [Nutchbase] - the sooner the better
 ;) ).

 I'm looking at src/gora/webpage.avsc and WebPage.java  friends...
 presumably the java code was autogenerated from avsc using Gora? If so, we
 should put this autogeneration step in our build.xml. Or am I missing
 something?


 correct. if we keep the generated java classes in svn then we probably want
 to make this task optional i.e. it would not be done as part of the build
 tasks OR we can add it to the build but remove it from svn (or better add to
 svn ignore or whatever-it-is-called).

 J.



 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: *chris.mattm...@jpl.nasa.gov
 *WWW:   *http://sunset.usc.edu/~mattmann/
 *++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++




-- 
Doğacan Güney

Minimizing the number of stored fields for Solr

2010-07-03 Thread Doğacan Güney

Hey everyone,

This is not really a proposition but rather something I have been wondering
for a while so I wanted to see what everyone is
thinking.

Currently in our solr backend, we have stored=true indexed=false fields
and stored=true indexed=true fields. The former
class of fields are mostly used for storing digest, caching information etc.
I suggest that we get rid of all indexed=false fields and
read all such data from storage backend.

For the latter class of fields (i.e., stored=true indexed=true), I suggest
that we set them to stored=false for everything but id field. As an
example currently title is stored/indexed in solr while text is only indexed
(thus, will need to be fetched from storage backend). But for hbase
backend, title and text are already stored close together (in the same
column family) so performance hit of reading just text or reading both
will likely be same. And removing storage from solr may lead to better
caching of indexed fields and may lead to better example.

What does everyone think?

-- 
Doğacan Güney

Nutchbase design doc

2010-07-03 Thread Doğacan Güney

 generator markers).

GeneratorJob will also print crawl id to console.

c) FetcherJob: During map phase, FetcherJob outputs Integer,
FetcherEntrykey, WebPage pairs. The first member of the pair (integer) is
a random integer between 0 and 65536. After map, these pairs are partitioned
into reduces according to URL's hosts. Reduce works exactly like old
Fetcher's map. FetcherJob can also continue interrupted fetches now (by
giving -continue switch on command line).

The random integer may seem pointless but it is actually quite important for
performance. Let's say instead of a random integer we were to just output
key, WebPage pairs from map. Again, let's say you have 1000 URLs from host
a.com, and 500 URLs from b.com. Let's also say that we have 10 fetcher
threads so maximum queue size is 500. In this case, since all URLs from host
a.com will be processed by reduce before all URLs from b.com, during reduce
phase, only one thread will fetch URLs from a.com while every other thread
will be spin-waiting. However, with randomization, URLs from a.com and
b.comwill be processed in a random order thus bandwidth utilization
will be
higher.

d) ParserJob: ParserJob is straightforward. It is only a map (i.e., has 0
reducers). Simple parses all URLs with active parse plugins.

e) DbUpdaterJob: This is a combination of updatedb and invertlinks jobs. If
a URL is successfully parsed (which means, it will contain a parse marker),
DbUpdaterJob will put its own marker. Note: It may make more sense to put a
marker even if a URL is not successfully parsed. DbUpdaterJob also cleans
all other markers.

f) IndexerJob: Goes over all URLs with a db update marker (again, you can
specify ALL URLs with update markers, or a crawl id), and indexes them.

4) What's missing

Most of the core functionality and plugins have been ported. However, some
tools and features are still missing: arc segment tools, PageRank scoring,
field indexing API, etc

-- 
Doğacan Güney

Re: Nutchbase design doc

2010-07-03 Thread Doğacan Güney

Hi Alex,

On Sat, Jul 3, 2010 at 14:45, Alex McLintock alex.mclint...@gmail.comwrote:

 Doğacan


 2010/7/3 Doğacan Güney doga...@gmail.com:
  I am attaching first draft of a complete nutchbase design document. There
  are parts missing and parts not yet explained clearly but I would like to
  get everyone's opinion on what they think so far.

 Thanks. I read your design and found it quite clear - at least to this
 non committer. :-)
 I would suggest that we should take this opportunity to do a full
 design document including design which has not changed from v1 to v2.
 So more please!

 I have made the odd comment which was then explained later on. I guess
 that meant I was a bit confused :-)


Thanks for the excellent comments. I will try to explain as best as I can.


   Please let me
  know which parts are unclear, which parts make no sense etc, and I will
  improve the draft.


 The main thing I missed was any kind of overview of the data flow. I'd
 like to see a description of how a url/webpage goes through our system
 from being unknown, injected or discovered, to being queued for
 fetching in the generator, to fetched, parsed, fetchscheduled, scored,
 and generated again.
 Plus of course, indexed by sending to solr and being seen by an end
 user application.

 At each stage I'd like to see where the data is stored (text file,
 hbase, solr) and especially how this differs from the previous (text
 file, nutch crawldb, solr)

 I know that some of this may sound like a tutorial, but it is worth
 doing now rather than putting it off until later.


One of the things nutchbase attempts to do is hide all the complexity of
managing individual segments,
crawl/link/whatever dbs from the user. Now, nutch delegates all storage
handling to Gora (http://github.com/enis/gora).
What Gora does is, it gives you a key-value store (in this case, keys are
reversed URLs, values are WebPage objects), and
you do all your work through these objects. So storage will not be an issue
for you. Right now, Gora (and thus nutch) supports
storing your data in hbase and sql (with cassandra and other backends coming
soon).

So with nutch and gora, you will start up your hbase/sql/cassandra/etc
server(s), then nutch will figure out what to store and where.




  
 -
  Nutchbase
  =
  1) Rationale
  * All your data in a central location (at least, nutch gives you the
  illusion of a centralized storage)

 But hbase is distributed across your hadoop cluster, right? This is
 the illusion you meant.


Yes. Also, cassandra will be distributed too. Maybe in the future someone
will
write a HDFS-backed backend to Gora, then your data will actually live in
separate
files, but will still look like one centralized storage to you.



  2) Design
  As mentioned above, all data for a URL is stored in a WebPage object.
 This
  object is accessed by a key that is the reverse form of a URL. For
 example,
  http://bar.foo.com:8983/to/index.html?a=b becomes
  com.foo.bar:http:8983/to/index.html?a=b

 This was clear and is the main point to convey :-) I would in fact
 like loads more info on the WebPage object.


WebPage contains all data we have for a URL. Think Content + Parse Text +
Parse Data +
Crawl Datum + Outlinks + Inlinks...


   If URLs are stored lexicographically, this means that URLs from same
 domain
  and host are stored closer together. This will hopefully make developing
  statistics tools easier for hosts and domains.


 I am unconvinced by this. Yes we want host urls together so that we
 can easily do polite fetching from individual hosts. But would it make
 statistics tools easier? Maybe i don't know enough about hbase to be
 sure.


This is not about polite fetching but let's say you want to count the number
of fetched URLs from host foo.com.
All you would have to do is to execute a scan (in hbase lingo, in gora these
are called queries), between the start
of foo.com and end of it. Since all URLs within a host are stored together,
you do not have to go over the entire
table to compute these statistics. Makes sense?


  Writing a MapReduce job that uses Gora for storage does not take much
  effort.


 This was confusing me. I thought that using Gora meant that we were
 losing the benefits of hdfs. So if we run a map reduce job over
 machines which are also HBase nodes does their input come from the
 hbase data stored on those nodes to reduce internal network traffic?


IIRC, we do that in Gora already. But even if we don't (which means we
forgot to do it),
using Gora means that you deal with straightforward java objects and Gora
figures out
what to store and where. As I said, your data can also be in sql, cassandra,
etc.

I guess part of the confusion is that the project was called nutchbase
(hence, implying
it is about tying nutch into hbase). But it was just a stupid name I made up

Re: Nutchbase design doc

2010-07-03 Thread Doğacan Güney

Hi Chris,

On Sat, Jul 3, 2010 at 18:35, Mattmann, Chris A (388J) 
chris.a.mattm...@jpl.nasa.gov wrote:

  Guys,

 This sounds awesome. Even I could understand it, which is saying something!
 :)

 My only question: why introduce a new data structure called “Markers” when
 all that seems to be is a Metadata object. Let’s use
 o.a.tika.metadata.Metadata to represent that? My only comment then would be,
 aren’t we still doing something you mentioned you wanted to get rid of
 below, where you said: “For example, during parsing we don't have access to
 a URL's fetch status. So we copy fetch status into content metadata.” Aren’t
 we just doing the same thing with Markers?


Actually, markers used to be stored in the metadata object in WebPage
(metadata is a map from string to bytes). It just seemed clearer to me to
put it into its own field. We can discuss if moving it back into metadata
makes more sense.

One thing: We can't use tika's metadata object as WebPage object is
generated from an avro schema.

As for your last comment: Markers are only used to identify where we are in
a crawl cycle and the individual crawl ids. So during parse, when we get a
URL during MapReduce, parse can easily check if that URL has been fetched in
*that* crawl cycle (since there is no point in parsing it if it hasn't been
fetched). So it is not used to pass any important information around. It is
just a simple tracking system. Did this make it any clearer?


 Cheers,
 Chris






 On 7/3/10 3:01 AM, Doğacan Güney doga...@gmail.com wrote:

 Hello everyone,

 I am attaching first draft of a complete nutchbase design document. There
 are parts missing and parts not yet explained clearly but I would like to
 get everyone's opinion on what they think so far. Please let me
 know which parts are unclear, which parts make no sense etc, and I will
 improve the draft.


 -

 Nutchbase
 =

 1) Rationale

 * All your data in a central location (at least, nutch gives you the
 illusion of a centralized storage)
 * No more segment/crawldb/linkdb merges.
 * No more missing data in a job. There are a lot of places where we copy
 data from one structure to another just so that it is available in a later
 job. For example, during parsing we don't have access to a URL's fetch
 status. So we copy fetch status into content metadata. This will no longer
 be necessary after nutchbase. When writing a job or a new plugin, programmer
 only needs to specify which fields she wants to read and they will be
 available to plugin / job.
 * A much simpler data model. If you want to update a small part in a single
 record, now you have to write a MR job that reads the relevant directory,
 change the single record, remove old directory and rename new directory.
 With nutchbase, you can just update that record.

 2) Design

 As mentioned above, all data for a URL is stored in a WebPage object. This
 object is accessed by a key that is the reverse form of a URL. For example,

 http://bar.foo.com:8983/to/index.html?a=b becomes com.foo.bar:
 http:8983/to/index.html?a=b

 If URLs are stored lexicographically, this means that URLs from same domain
 and host are stored closer together. This will hopefully make developing
 statistics tools easier for hosts and domains.

 Writing a MapReduce job that uses Gora for storage does not take much
 effort. There is a new class called StorageUtils that has a number of static
 methods to make setting mappers/reducers/etc easier. Here is an example
 (from GeneratorJob.java):

 Job job = new NutchJob(getConf(), generate:  + crawlId);
 StorageUtils.initMapperJob(job, FIELDS, SelectorEntry.class,
 WebPage.class,
 GeneratorMapper.class, URLPartitioner.class);
 StorageUtils.initReducerJob(job, GeneratorReducer.class);

 An important argument is the second argument to #initMapperJob. This
 specifies all the fields that this job will be reading. If plugins will run
 during a job (for example, during ParserJob, several plugins will be
 active), then before job is run, those plugins must be initialized and
 FieldPluggable#getFields must be called for all plugins to figure out which
 fields they want to read. During map or reduce phase, modifying WebPage
 object is as simple as using the built-in setters. All changes will be
 persisted.

 Even though some of these objects are still there, most of the CrawlDatum,
 Content, ParseData or similar objects are removed (or are slated to be
 removed). In most cases, plugins will simply take (String key, WebPage page)
 as arguments and modify WebPage object in-place.

 3) Jobs

 Nutchbase uses the concept of a marker to identify what has been processed
 and what will be processed from now on. WebPage object contains a
 mapString, String called markers. For example, when GeneratorJob generates
 a URL, it puts a unique string and a unique

Re: Nutch 2.0

2010-06-29 Thread Doğacan Güney

Hi,

On Tue, Jun 29, 2010 at 11:49, Julien Nioche
lists.digitalpeb...@gmail.comwrote:

 Thanks Chris,

 I already shared my thoughts on this yesterday, but I still fail to see the
 advantage of keeping the details of the recent github nutchbase commits
 (some of them being just upgrades to the recent changes in 1.1) in svn
 nutchbase knowing that the point is actually to do incremental changes to
 the existing trunk (which already has the 1.1 changes) from svn nutchbase
 and review / comment / improve the code on this occasion.

 Since we also want to produce a patch in JIRA for the changes in svn
 nutchbase in order to put the donated to Apache stamp on it it would make
 sense to do that just once and not for all the commits which have been done
 in github.

 I am probably missing an important point here, but if so I would appreciate
 if someone (Dogacan?) could explain why we should not stick to the original
 plan
 (a) clear the existing svn nutchbase
 (b) generate a large patch with the code from github and JIRA it


Do you mean generating a single patch vs nutch? There are a lot of fixes and
improvements in nutch 1.1 that I cherry-picked to nutchbase later. If we
generate
a larger patch, and then this branch is blessed as trunk then history for
those improvements will be lost.

Or am I misunderstanding you here?




(c) commit the changes to svn nutchbase
 then get on with the interesting bits.

 My concern is that proceeding as Dogacan described yesterday might take
 quite some time and block the rest of the work on 2.0. I am happy to work on
 the 3 steps above BTW.

 Thanks

 Julien





 On 29 June 2010 06:44, Mattmann, Chris A (388J) 
 chris.a.mattm...@jpl.nasa.gov wrote:

  Okey dokey guys, (c), (e) and (g) are done.

 Julien, Doğacan, your turn on (a) and (d) and then we can all work on (e)
 and (f)...

 Cheers,
 Chris




 On 6/28/10 12:55 PM, Doğacan Güney doga...@gmail.com wrote:

 On Mon, Jun 28, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote:

 On 2010-06-28 17:57, Mattmann, Chris A (388J) wrote:
  Hi Doğacan,
 
  So your proposition is to combine (a) and (b) then? That’s fine by me,
  so long as there are no objections from others. I can still move forward
  with , (e) and (g) then...


 No objections from me - but IMHO to satisfy the legal minds you still
 need to produce a patch and attach to an issue with the Grant to ASF
 checkbox marked...


 OK, I'll create a new issue in JIRA, and then attach a lot of patches :)

 I'll try to appropriately mark patches that are straightforward ports from
 nutch 1.1
 into nutchbase so that the same committers can commit those patches
 _again_
 hopefully preserving post nutch 1.0 history as much as possible.


 (Also, I always shudder when I imagine a massive merge failing ... but
 that's probably a leftover from my CVS days when a failed merge would
 leave a completely broken tree.. ah, well, good luck :) ).


 I regularly do large merges in git and it works beautifully. We'll see how
 well
 SVN does :)



 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com





 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: *chris.mattm...@jpl.nasa.gov
 *WWW:   *http://sunset.usc.edu/~mattmann/http://sunset.usc.edu/%7Emattmann/
 *++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++




 --
 DigitalPebble Ltd

 Open Source Solutions for Text Engineering
 http://www.digitalpebble.com




-- 
Doğacan Güney

Re: Nutch 2.0

2010-06-28 Thread Doğacan Güney

Hey all,

I will double check to make sure, but IIRC, there is no need to delete
svn:nutchbase since current code on
github simply builds on top of that. So why not simply merge github branch
into svn? It will be a clear merge...
The only problem is contributor info is messed up in github but I tried to
preserve as much contrib info as possible
when I pulled in 1.1 changes (via git cherry-pick). So we can break the code
in github into smaller patches, apply them
on top of svn nutchbase (which, again, will be clean) then, 1.1 changes can
be applied by _original_ committers, thus
hopefully preserving contributor info as well.

Makes sense?

On Mon, Jun 28, 2010 at 16:45, Julien Nioche
lists.digitalpeb...@gmail.comwrote:

  Hi,

 (a) deleting svn:nutchbase
 (b) svn:importing Git Nutchbase.
 (c) branch current 1.2-trunk as 1.2-branch
 (d) iteratively apply patches from new svn:nutchbase to trunk to bring
 it up to snuff.
  (e) roll the version # in nutch trunk to 2.0-dev
 (f) all issues in JIRA should be updated to reflect 2.0-dev fixes
 where
 it makes sense
 (g) a 2.1 version is added to mark anything that we don't want in 2.0
 and we file post 2.0 issues there
 (h) Nutch 2.0 trunk is fixed, and brought up to speed and old code is
 removed. All unit tests should pass regression where it makes sense.
 (i) Nutch documentation is brought up to date on wiki and checked into
 SVN
 (j) We roll a 2.0 release


 +1



 I'd be happy to do (a), (c), (e) and (g) tomorrow, and would like to
 participate in (d) and (f).

 I'm thinking Julien and Doğacan would be the
 best people to do (b) and (i).


 Doğacan is in the process of writing the documentation



 (h) should be a result of all steps prior
 (a)-(g), and as for (j), I'd be happy to do (j) when the time comes.

 So, if I don't hear any objections, I'll do (a), (c), (e) and (g)
 tomorrow... (6/28, likely PM PST Los Angeles time)


 cool, thanks

 J.
 --
 DigitalPebble Ltd

 Open Source Solutions for Text Engineering
 http://www.digitalpebble.com




-- 
Doğacan Güney

Re: Nutch robot hitting our Web servers

Re: [Nutchbase] Multi-value ParseResult missing

Merging in nutchbase

Re: Merging in nutchbase

Re: Merging in nutchbase

Re: [Nutchbase] WebPage class is a generated code?

Minimizing the number of stored fields for Solr

Nutchbase design doc

Re: Nutchbase design doc

Re: Nutchbase design doc

Re: Nutch 2.0

Re: Nutch 2.0

12 matches

Site Navigation

Mail list logo

Footer information