Re: Reviving Nutch 0.7
Otis, Some time ago people on the list said that they are willing to at least maintain Nutch 0.7 branch. As a committer (not very active recently) I volunteered to commit patches when they appear - I do not have enough time at the moment to do active coding. I have created a 7.3 release in JIRA so we can start looking at it. So - we are ready and willing to move Nutch 0.7 forward but it looks like there is no interest at the moment. Regards Piotr On 1/22/07, Otis Gospodnetic [EMAIL PROTECTED] wrote: Hi, I've been meaning to write this message for a while, and Andrzej's StrategicGoals made me compose it, finally. Nutch 0.8 and beyond is very cool, very powerful, and once Hadoop stabilizes, it will be even more valuable than it is today. However, I think there is still a need for something much simpler, something like what Nutch 0.7 used to be. Fairly regular nutch-user inquiries confirm this. Nutch has too few developers to maintain and further develop both of these concepts, and the main Nutch developers need the more powerful version - 0.8 and beyond. So, what is going to happen to 0.7? Maintenance mode? I feel that there is enough need for 0.7-style Nutch that it might be worth at least considering and discussing the possibility of somehow branching that version into a parallel project that's not just in a maintenance mode, but has its own group of developers (not me, no time :( ) that pushes it forward. Thoughts? Otis
Re: How to Become a Nutch Developer
On 1/21/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Well ... so far this process was very informal, because there were so few key developers that they more or less knew what needs to be done, and who is doing what. Hadoop follows a much stricter and formalized model, which we could adopt, since it apparently works well there. This should address the issue of notifying others that the work is started on this or that item. My 2 cents :-) .. I like the way Hadoop guys works! It is strict but you to my mind it brings more benefit to be structured/rigid for the newbie developer cos you can follow every issue from start to end and all the comments in between I have notice some of the mailing list questions/answers related to issues for example are not in Nutch JIRA so to follow an issue you have to go-back-and-forth consult mailing list and JIRA. IMHO Nutch should adopt Hadoop model furthermore its probably to good idea to discuss it further cos soon Nutch will have an 0.9 release and probably its a good time to change to Hadoop style :-) Just some thoughts. Cheers
Re: Reviving Nutch 0.7
On 1/22/07, Otis Gospodnetic [EMAIL PROTECTED] wrote: Hi, I've been meaning to write this message for a while, and Andrzej's StrategicGoals made me compose it, finally. Nutch 0.8 and beyond is very cool, very powerful, and once Hadoop stabilizes, it will be even more valuable than it is today. However, I think there is still a need for something much simpler, something like what Nutch 0.7 used to be. Fairly regular nutch-user inquiries confirm this. Nutch has too few developers to maintain and further develop both of these concepts, and the main Nutch developers need the more powerful version - 0.8 and beyond. So, what is going to happen to 0.7? Maintenance mode? I feel that there is enough need for 0.7-style Nutch that it might be worth at least considering and discussing the possibility of somehow branching that version into a parallel project that's not just in a maintenance mode, but has its own group of developers (not me, no time :( ) that pushes it forward. Thoughts? I agree with you that there is a need for 0.7-style Nutch. I wouldn't say reviving but more Disecting and re-directing :-). here you go --- my focus here is 0.7 style i.e. mid-size, enterprise need. Solr could use a good crawler cos it has everything else .. (AFAIK) probably this is not technically plug an pray :-) also I am not sure Solr community wants a crawler but it could benefit from such Solr add on/snap on crawler. Furthermore I am sure some of the 0.7 plugins could be re-factored to fit into Solr. I will forward the mail to Solr community to see if there any interest. Cheers
RE: Reviving Nutch 0.7
Hello, I'm writing this on behalf of both Armel Nene and myself. We think that you and those who have responded have a point. We've been experiencing quite a number of problems with getting Nutch 0.8 adapted for our needs, and making changes to support evolving business requirements as they come up. So much so, that we've considered replacing the spine of Nutch with our own programs, which would still be compatible with the Nutch plugins (same parameters etc.), but that would allow us more ease in making changes and debug. We've decided to lay out some of our challenges for you to consider. Our major needs are the ability to deploy on large enterprise file systems (1-10 Terabytes, large compared to average file systems, but small compared to the WWW). We also need to support http, but only specific web sites, subscription web sites and so on. We don't need to replicate a generic-Google implementation. The main features we are currently working on relate primarily to near-real-time crawling, specifically: - Incremental Crawling, where changes are monitored at the folder level, which is much faster than fetching every URL and checking for a change. Note that this is similar to adaptive crawling, but will be even more efficient. - Special handling for parsing of large files (possibly farming those out to dedicated processors a-la Amazon). Hadoop would be useful here, but we would consider re-adding this at a later stage. - Incremental Indexing, where documents are added to or removed from a live index, instead of rebuilding a new index each time. We would be happy to join a group of 0.7 developers, if that would enable us to pursue this enterprise-based direction, which clearly has different challenges than those facing WWW-crawling. Best regards, Alan _ Alan Tanaman iDNA Solutions http://blog.idna-solutions.com -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: 22 January 2007 06:48 To: Nutch Developer List Subject: Reviving Nutch 0.7 Hi, I've been meaning to write this message for a while, and Andrzej's StrategicGoals made me compose it, finally. Nutch 0.8 and beyond is very cool, very powerful, and once Hadoop stabilizes, it will be even more valuable than it is today. However, I think there is still a need for something much simpler, something like what Nutch 0.7 used to be. Fairly regular nutch-user inquiries confirm this. Nutch has too few developers to maintain and further develop both of these concepts, and the main Nutch developers need the more powerful version - 0.8 and beyond. So, what is going to happen to 0.7? Maintenance mode? I feel that there is enough need for 0.7-style Nutch that it might be worth at least considering and discussing the possibility of somehow branching that version into a parallel project that's not just in a maintenance mode, but has its own group of developers (not me, no time :( ) that pushes it forward. Thoughts? Otis
Re: Fetcher2
Fetcher2 should be a great help for me,but seems can't integrate with Nutch81. Any advice on how to use it based on .81? - Original Message - From: Andrzej Bialecki [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Thursday, January 18, 2007 5:18 AM Subject: Fetcher2 Hi all, I just committed a new implementation of venerable fetcher, called Fetcher2. It uses a producer/consumers model with a set of per-host queues. Theoretically it should be able to achieve a much higher throughput, especially for fetchlists with a lot of contention (many urls from the same hosts). It should be possible to achieve the same fetching rate with a smaller number of threads, and most importantly to avoid the dreaded Exceeded http.max.delays: retry later error. It is available through bin/nutch fetch2. From the javadoc: A queue-based fetcher. This fetcher uses a well-known model of one producer (a QueueFeeder) and many consumers (FetcherThread-s). QueueFeeder reads input fetchlists and populates a set of FetchItemQueue-s, which hold FetchItem-s that describe the items to be fetched. There are as many queues as there are unique hosts, but at any given time the total number of fetch items in all queues is less than a fixed number (currently set to a multiple of the number of threads). As items are consumed from the queues, the QueueFeeder continues to add new input items, so that their total count stays fixed (FetcherThread-s may also add new items to the queues e.g. as a results of redirection) - until all input items are exhausted, at which point the number of items in the queues begins to decrease. When this number reaches 0 fetcher will finish. This fetcher implementation handles per-host blocking itself, instead of delegating this work to protocol-specific plugins. Each per-host queue handles its own politeness settings, such as the maximum number of concurrent requests and crawl delay between consecutive requests - and also a list of requests in progress, and the time the last request was finished. As FetcherThread-s ask for new items to be fetched, queues may return eligible items or null if for politeness reasons this host's queue is not yet ready. If there are still unfetched items on the queues, but none of the items are ready, FetcherThread-s will spin-wait until either some items become available, or a timeout is reached (at which point the Fetcher will abort, assuming the task is hung). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Reviving Nutch 0.7
2007/1/22, Otis Gospodnetic [EMAIL PROTECTED]: Hi, I've been meaning to write this message for a while, and Andrzej's StrategicGoals made me compose it, finally. Nutch 0.8 and beyond is very cool, very powerful, and once Hadoop stabilizes, it will be even more valuable than it is today. However, I think there is still a need for something much simpler, something like what Nutch 0.7 used to be. Fairly regular nutch-user inquiries confirm this. Nutch has too few developers to maintain and further develop both of these concepts, and the main Nutch developers need the more powerful version - 0.8 and beyond. So, what is going to happen to 0.7? Maintenance mode? I feel that there is enough need for 0.7-style Nutch that it might be worth at least considering and discussing the possibility of somehow branching that version into a parallel project that's not just in a maintenance mode, but has its own group of developers (not me, no time :( ) that pushes it forward. Thoughts? Before doubling (or after 0.9.0 tripling?) the maintenance/development work please consider the following: One option would be re factoring the code in a way that the parts that are usable to other projects like protocols?, parsers (this actually was proposed by Jukka Zitting some time last year) and stuff would be modified to be independent of nutch (and hadoop) code. Yeah, this is easy to say, but would require significant amount of work. The more focused,smaller chunks of nutch would probably also get bigger audience (perhaps also outside nutch land) and that way perhaps more people willing to work for them. Don't know about others but at least I would be more willing to work towards this goal than the one where there would be practically many separate projects, each sharing common functionality but different code base. -- Sami Siren
Re: Reviving Nutch 0.7
Before doubling (or after 0.9.0 tripling?) the maintenance/development work please consider the following: One option would be re factoring the code in a way that the parts that are usable to other projects like protocols?, parsers (this actually was proposed by Jukka Zitting some time last year) and stuff would be modified to be independent of nutch (and hadoop) code. Yeah, this is easy to say, but would require significant amount of work. The more focused,smaller chunks of nutch would probably also get bigger audience (perhaps also outside nutch land) and that way perhaps more people willing to work for them. Don't know about others but at least I would be more willing to work towards this goal than the one where there would be practically many separate projects, each sharing common functionality but different code base. +1 ;) This was actually the project proposed by Jerome Charron and myself, called Tika. We went so far as to create a project proposal, and send it out to the nutch-dev list, as well as the Lucene PMC for potential Lucene sub-project goodness. I could probably dig up the proposal should the need arise. Good ol' Jukka then took that effort and created us a project within Google code, that still lives in there in fact: http://code.google.com/p/tika/ There hasn't be active development on it because: 1. None of us (I'm speaking for Jerome, and myself here) ended up having the time to shepherd it going forward 2. There was little, if any response, from the proposal to the nutch-dev list, and folks willing to contribute (besides people like Jukka) 3. I think, as you correctly note above, most people thought it to be too much of a Herculean effort that wouldn't pay the necessary dividends in the end to undertake it In any case, I think that, if we are going to maintain separate branches of the source, in fact, really parallel projects, then an undertaking such as Tika is properly needed ... Cheers, Chris -- Sami Siren
Re: Reviving Nutch 0.7
Chris Mattmann wrote: In any case, I think that, if we are going to maintain separate branches of the source, in fact, really parallel projects, then an undertaking such as Tika is properly needed ... I still don't think we need separate project to start with, IMO right mode of mind is enough to get going. If people thing this is right direction and it goes beyond talk then perhaps after that we could start talking about separate project. -- Sami Siren
Re: How to Become a Nutch Developer
Thanks to everyone for the input. I know some of these questions are obvious but I wanted to take it from the lowest possible level. Part of the document is already posted to the wiki here. http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer It seems like I am getting a section done each night so everything should be done it a couple of days. Dennis Kubes Chris Mattmann wrote: Hi Dennis, On 1/21/07 11:47 AM, Dennis Kubes [EMAIL PROTECTED] wrote: All, I am working on a How to Become a Nutch Developer document for the wiki and I need some input. I need an overview of how the process for JIRA works? If I am a developer new to Nutch and just starting to look at the JIRA and I want to start working on some piece of functionality or to help with bug fixes where would I look. JIRA provides a lot of search facilities: it's actually kind of nice. The starting point for browsing bugs and other types of issues is: http://issues.apache.org/jira/browse/NUTCH (in general, for all Apache projects that use JIRA, you'll find that their issue tracking system boils down to: http://issues.apache.org/jira/browse/APACHE_PROJ_JIRA_ID ) From there, you can access canned filters for open issues like: Blocker Critical Major Minor Trivial For more detailed search capabilities, click on the Find Issues button at the top breadcrumb bar. Search capabilities there include the ability to look for issues by developer, status, issue type, and to combine such fields using AND, and OR. Additionally, you can issue a free text query across all issues by using the free text box there. Would I just choose something that is unscheduled and begin working on it? That's a good starting point: additionally, high priority issues marked as Blockers, Critical and Major are always good because the sooner we (the committers) get a patch for those, the sooner we'll be testing it for inclusion into the sources. What if I see something that I want to work on but it is scheduled to somebody else? Walk five paces opposite your opponent: turn, then sho...err, wait. Nah, you don't have to do that. ;) Just speak up on the mailing list, and volunteer your support. One of the people listed in the group nutch-developers in JIRA (e.g., the committers) can reassign the issue to you so long as the other gent it was assigned to doesn't mind... Are items only scheduled to committers or can they be scheduled to developers as well? If they can be scheduled to regular developers how does someone get their name on the list to be scheduled items? Items can be scheduled to folks listed in the nutch-developers group within JIRA. Most of these folks are the committers, however, not all of them are. I'm not entirely sure how folks get into that group (maybe Doug?), however, that's the real criteria for having a JIRA issue officially assigned to you. However, that doesn't mean that you can't work on things in lieu of that. If there's an issue that you'd like to contribute to, please, prepare a patch, attach it to JIRA, and then speak up on the mailing list. Chances are, with the recent busy schedules of the committers (including myself) besides Sami, and Andrzej, the committers don't have time to prepare patches for the issue assigned to them. If you contribute a great patch, the committer will pick it up, test it, apply it, and you'll get the same effect as if the issue were directly assigned to you. Should I submit a JIRA and/or notify the list before I start working on something? What is the common process for this? Yup, that's pretty much it. Voice your desire to work on a particular task on the nutch-dev list. Many of the developers on that list have been around for a while now, and they know what's been discussed, and implemented before. When I submit a JIRA is there anything else I need to do either in the JIRA system or with the mailing lists, committers, etc? Nope: the nutch-dev list is automatically notified by all JIRA issue submissions, and the committers (and rest of the folks) will pick up on this and act accordingly. Getting this information together in one place will go a long way toward helping others to start contributing more and more. Thanks for all your input. No probs, glad to be of service :-) Cheers, Chris Dennis Kubes
Re: Fetcher2
chee wu wrote: Fetcher2 should be a great help for me,but seems can't integrate with Nutch81. Any advice on how to use it based on .81? You would have to port it to Nutch 0.8.1 - e.g. change all Text occurences to UTF8, and most likely make other changes too ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: How to Become a Nutch Developer
Dennis Kubes wrote: What does the Hadoop project do differently than Nutch. I thought they both were run about the same way? Is it that all communication on issues goes through the JIRA? The workflow is different - I'm not sure about the details, perhaps Doug can correct me if I'm wrong ... and yes, it uses JIRA extensively. 1. An issue is created 2. patches are added, removed commented, etc... 3. finally, a candidate patch is selected, and the issue is marked Patch available. 4. An automated process applies the patch to a temporary copy, and checks whether it compiles and passes junit tests. 5. A list of patches in this state is available, and committers may pick from this list and apply them. 6. An explicit link is made between the issue and the change set committed to svn (Is this automated?) 7. The issue is marked as Resolved, but not closed. I believe issues are closed only when a release is made, because issues in state resolved make up the Changelog. I believe this is also automated. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: java.io.EOFException in latest nightly in mergesegs from hadoop.io.DataOutputBuffer
On Jan 21, 2007, at 6:47 AM, Sami Siren wrote: However I cannot find from the change logs of hadoop that what the change is that is causing nutch these problems. It's HADOOP-331, so i guess at least the changes/additions in map() is required. Hi, just following up here-- does this indicate that if I get a hadoop nightly that was patched for HADOOP-331 and have Nutch use it, the EOFException will go away in the latest nightlies? I tried that, it now crashes in a different spot, during fetching: 2007-01-22 11:34:53,051 INFO mapred.LocalJobRunner - 1 pages, 0 errors, 1.0 pages/s, 20 kb/s, 2007-01-22 11:34:53,134 WARN mapred.LocalJobRunner - job_yzavye java.lang.NoSuchMethodError: org.apache.hadoop.io.MapFile $Writer.init(Lorg/apache/hadoop/fs/FileSystem;Ljava/lang/ String;Ljava/lang/Class;Ljava/lang/Class;)V at org.apache.nutch.fetcher.FetcherOutputFormat.getRecordWriter (FetcherOutputFormat.java:58) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:303) at org.apache.hadoop.mapred.LocalJobRunner$Job.run (LocalJobRunner.java:137) 2007-01-22 11:34:53,398 FATAL fetcher.Fetcher - Fetcher: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java: 441) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470) at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505) at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189) at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477)
Re: How to Become a Nutch Developer
Andrzej Bialecki wrote: The workflow is different - I'm not sure about the details, perhaps Doug can correct me if I'm wrong ... and yes, it uses JIRA extensively. 1. An issue is created 2. patches are added, removed commented, etc... 3. finally, a candidate patch is selected, and the issue is marked Patch available. Patch Available is code for the contributor now believes this is ready to commit. Once a patch is in this state, a committer reviews it and either commits it or rejects it, changing the state of the issue back to Open. The set of issues in Patch Available thus forms a work queue for committers. We try not to let a patch sit in this state for more than a few days. 4. An automated process applies the patch to a temporary copy, and checks whether it compiles and passes junit tests. This is currently hosted by Yahoo!, run by Nigel Daley, but it wouldn't be hard to run this for Nutch on lucene.zones.apache.org, and I think Nigel would probably gladly share his scripts. This step saves committers time: if a patch doesn't pass unit tests, or has javadoc warnings, etc. this can be identified automatically. 5. A list of patches in this state is available, and committers may pick from this list and apply them. 6. An explicit link is made between the issue and the change set committed to svn (Is this automated?) Jira does this based on commit messages. Any bug ids mentioned in a commit message create links from that bug to the revision in subversion. Hadoop commits messages usually start with the bug id, e.g., HADOOP-1234. Remove a deadlock in the oscillation overthruster. 7. The issue is marked as Resolved, but not closed. I believe issues are closed only when a release is made, because issues in state resolved make up the Changelog. I believe this is also automated. Jira will put resolved issues into the release notes regardless of whether they're closed. The reason we close issues on release is to keep folks from re-opening them. We want the release notes to be the list of changes in a release, so we don't want folks re-opening issues and having new commits made against them, since then the changes related to the issue will span multiple releases. If an issue is closed but there's still a problem, a new issue should be created linking to the prior issue, so that the new issue can be scheduled and tracked without modifying what should be a read-only release. Doug
Re: Reviving Nutch 0.7
[EMAIL PROTECTED] wrote: Yes, certainly, anything that can be shared and decoupled from pieces that make each branch (not SVN/CVS branch) different, should be decoupled. But I was really curious about whether people think this is a valid idea/direction, not necessarily immediately how things should be implemented. In my mind, one branch is the branch that runs on top of Hadoop, with NameNode, DataNode, HDFS, etc. That's the branch that's in the trunk. The other branch is a simpler branch without all that Hadoop stuff, for folks who need to fetch, index, and search a few hundred thousand or a few million or even a few tens of millions of pages, and don't need replication, etc. that comes with Hadoop. That branch could be based off of 0.7. I also know that a lot of people are trying to use Nutch to build vertical search engines, so there is also a need for a focused fetcher. Kelvin Tan brought this up a few times, too, I believe. Branching doesn't sound like the right solution here. First, one doesn't need to run any Hadoop daemons to use Nutch: everything should run fine in a single process by default. If there are bugs in this they should be logged, folks who care should submit high-quality, back-compatible, generally useful patches, and committers should work to get these patches committed to the trunk. Second, if there are to be two modes of operation, wouldn't they best be developed in a common source tree, so that they share as much as possible and diverge as little as possible? It seems to me that a good architecture would be to agree on a common high-level API, then use two different runtimes underneath, one to support distributed operation, and one to support standalone operation. Hey! That's what Hadoop already does! Maybe it's not perfect and someone can propose a better way to share maximal amounts of code, but the code split should probably be into different classes and packages in a single source tree maintained by a single community of developers, not by branching a single source tree in a revision control and splitting the developers. Third, part of the problem seems like there are two few contributors--that the challenges are big and the resources limited. Splitting the project will only spread those resources more thinly. What really is the issue here? Are good patches languishing? Are there patches that should be committed (meet coding standards, are back-compatible, generally useful, etc.) but are not? A great patch is one that a committer can commit it with few worries: it includes new unit tests, it passes all existing unit tests, it fixes one thing only, etc. Such patches should not have to wait long for commit. And once someone submits a few such patches, then one should be invited to become a committer. It sounds to me like the problem is that, off-the-shelf, Nutch does not yet solve all the problems folks would like it too: e.g., it has never done a good job with incremental indexing. Folks see progress made on scalability, but really wish it were making more progress on incrementality or something else. But it's not going to make progress on incrementality without someone doing the work. A fork or a branch isn't going to do the work. I don't see any reason that the work cannot be done right now. It can be done incrementally: e.g., if the web db API seems inappropriate for incremental updates, then someone should submit a patch that provides an incremental web db API, updating the fetcher and indexer to use this. A design for this on the wiki would be a good place to start. Finally, web crawling, indexing and searching are data-intensive. Before long, users will want to index tens or hundreds of millions of pages. Distributed operation is soon required at this scale, and batch-mode is an order-of-magnitude faster. So be careful before you threw those features out: you might want them back soon. Doug
Re: java.io.EOFException in latest nightly in mergesegs from hadoop.io.DataOutputBuffer
Brian Whitman wrote: On Jan 21, 2007, at 6:47 AM, Sami Siren wrote: However I cannot find from the change logs of hadoop that what the change is that is causing nutch these problems. It's HADOOP-331, so i guess at least the changes/additions in map() is required. Hi, just following up here-- does this indicate that if I get a hadoop nightly that was patched for HADOOP-331 and have Nutch use it, the EOFException will go away in the latest nightlies? No, I mean that HADOOP-331 is the change that is _causing_ these, so we need to adapt nutch code to coop with the change in sorting. Is there somebody that can tell me why the various utilities (like Indexer) is doing the wrapping to ObjectWritable in InputFormat and not in Mapper.map in the first place? Is this optimization of some kind? -- Sami Siren
Re: java.io.EOFException in latest nightly in mergesegs from hadoop.io.DataOutputBuffer
Sami Siren wrote: Brian Whitman wrote: On Jan 21, 2007, at 6:47 AM, Sami Siren wrote: However I cannot find from the change logs of hadoop that what the change is that is causing nutch these problems. It's HADOOP-331, so i guess at least the changes/additions in map() is required. Hi, just following up here-- does this indicate that if I get a hadoop nightly that was patched for HADOOP-331 and have Nutch use it, the EOFException will go away in the latest nightlies? No, I mean that HADOOP-331 is the change that is _causing_ these, so we need to adapt nutch code to coop with the change in sorting. Is there somebody that can tell me why the various utilities (like Indexer) is doing the wrapping to ObjectWritable in InputFormat and not in Mapper.map in the first place? Is this optimization of some kind? This is a legacy from the (very recent) times when you had to set a key/value class of the InputFormat in your mapred job. You don't have to do this now - it's handled transparently by InputFormat.getRecordReader().createKey() and createValue(). In fact, there's a lot of this cruft left over in Nutch. We should also use GenericWritable in most of these places, and indeed we could wrap the values in Mapper.map(). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: How to Become a Nutch Developer
+1 for adopting the same types of process with Nutch. Doug Cutting wrote: Andrzej Bialecki wrote: The workflow is different - I'm not sure about the details, perhaps Doug can correct me if I'm wrong ... and yes, it uses JIRA extensively. 1. An issue is created 2. patches are added, removed commented, etc... 3. finally, a candidate patch is selected, and the issue is marked Patch available. Patch Available is code for the contributor now believes this is ready to commit. Once a patch is in this state, a committer reviews it and either commits it or rejects it, changing the state of the issue back to Open. The set of issues in Patch Available thus forms a work queue for committers. We try not to let a patch sit in this state for more than a few days. 4. An automated process applies the patch to a temporary copy, and checks whether it compiles and passes junit tests. This is currently hosted by Yahoo!, run by Nigel Daley, but it wouldn't be hard to run this for Nutch on lucene.zones.apache.org, and I think Nigel would probably gladly share his scripts. This step saves committers time: if a patch doesn't pass unit tests, or has javadoc warnings, etc. this can be identified automatically. 5. A list of patches in this state is available, and committers may pick from this list and apply them. 6. An explicit link is made between the issue and the change set committed to svn (Is this automated?) Jira does this based on commit messages. Any bug ids mentioned in a commit message create links from that bug to the revision in subversion. Hadoop commits messages usually start with the bug id, e.g., HADOOP-1234. Remove a deadlock in the oscillation overthruster. 7. The issue is marked as Resolved, but not closed. I believe issues are closed only when a release is made, because issues in state resolved make up the Changelog. I believe this is also automated. Jira will put resolved issues into the release notes regardless of whether they're closed. The reason we close issues on release is to keep folks from re-opening them. We want the release notes to be the list of changes in a release, so we don't want folks re-opening issues and having new commits made against them, since then the changes related to the issue will span multiple releases. If an issue is closed but there's still a problem, a new issue should be created linking to the prior issue, so that the new issue can be scheduled and tracked without modifying what should be a read-only release. Doug
Re: Reviving Nutch 0.7
On 1/22/07, Doug Cutting [EMAIL PROTECTED] wrote: Finally, web crawling, indexing and searching are data-intensive. Before long, users will want to index tens or hundreds of millions of pages. Distributed operation is soon required at this scale, and batch-mode is an order-of-magnitude faster. So be careful before you threw those features out: you might want them back soon. Doug As a developer building application on top of Nutch, my experience is that I can't go back to version 0.7x because the features in version 0.8/0.9 are so much needed even for non-distributed crawling/indexing. For example, I can run crawling/indexing on a linux server and a windows laptop separately, and merge newly crawled databases into the main crawldb. I remember v0.7can't merge separate crawldb without lots of customization. It may takes some time to switch from 0.7x to v0.8/0.9 especially if you have lots of customization code. But, once you get over this one hurdle, you will enjoy the new and better features in 0.8/0.9 version. Also, this may be the time to re-think the design of your application. For my own project, I always try to separate my code from nutch core code as much as possible so that I can easily upgrade the application to keep up with new nutch release. Keeping away from the newest nutch version is somewhat backward to me. AJ -- AJ Chen, PhD Palo Alto, CA http://web2express.org
Re: How to Become a Nutch Developer
Doug Can you answer the question of how to add developer names to JIRA or if that is only for committers? Dennis Doug Cutting wrote: Andrzej Bialecki wrote: The workflow is different - I'm not sure about the details, perhaps Doug can correct me if I'm wrong ... and yes, it uses JIRA extensively. 1. An issue is created 2. patches are added, removed commented, etc... 3. finally, a candidate patch is selected, and the issue is marked Patch available. Patch Available is code for the contributor now believes this is ready to commit. Once a patch is in this state, a committer reviews it and either commits it or rejects it, changing the state of the issue back to Open. The set of issues in Patch Available thus forms a work queue for committers. We try not to let a patch sit in this state for more than a few days. 4. An automated process applies the patch to a temporary copy, and checks whether it compiles and passes junit tests. This is currently hosted by Yahoo!, run by Nigel Daley, but it wouldn't be hard to run this for Nutch on lucene.zones.apache.org, and I think Nigel would probably gladly share his scripts. This step saves committers time: if a patch doesn't pass unit tests, or has javadoc warnings, etc. this can be identified automatically. 5. A list of patches in this state is available, and committers may pick from this list and apply them. 6. An explicit link is made between the issue and the change set committed to svn (Is this automated?) Jira does this based on commit messages. Any bug ids mentioned in a commit message create links from that bug to the revision in subversion. Hadoop commits messages usually start with the bug id, e.g., HADOOP-1234. Remove a deadlock in the oscillation overthruster. 7. The issue is marked as Resolved, but not closed. I believe issues are closed only when a release is made, because issues in state resolved make up the Changelog. I believe this is also automated. Jira will put resolved issues into the release notes regardless of whether they're closed. The reason we close issues on release is to keep folks from re-opening them. We want the release notes to be the list of changes in a release, so we don't want folks re-opening issues and having new commits made against them, since then the changes related to the issue will span multiple releases. If an issue is closed but there's still a problem, a new issue should be created linking to the prior issue, so that the new issue can be scheduled and tracked without modifying what should be a read-only release. Doug
Re: How to Become a Nutch Developer
Dennis Kubes wrote: Can you answer the question of how to add developer names to JIRA or if that is only for committers? It's not just for committers, but also for regular contributors. I have added you. Anyone else? Doug
Finished How to Become a Nutch Developer
All, Draft version of How to Become a Nutch Developer is on the wiki at: http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer Please take a look and if you think anything needs to be added, removed, or changed let me know. Dennis Kubes