Re: updating retry inteval

2008-06-19 Thread John Martyniak
Sent: Tuesday, June 17, 2008 6:19:51 PM Subject: updating retry inteval is there a way to update the retry interval for a specific url? -Chris -- John Martyniak Before Dawn Solutions, Inc. 9457 S. University Blvd. #266 Highlands Ranch, CO 80126 o: 1-877-499-1562 x707 (Toll Free) c

Re: Question about Nutch crawling

2008-07-02 Thread John Martyniak
! NewGuyInNutch -- John Martyniak Before Dawn Solutions, Inc. 9457 S. University Blvd. #266 Highlands Ranch, CO 80126 o: 1-877-499-1562 x707 (Toll Free) c: 303-522-1756 e: [EMAIL PROTECTED] w: http://www.beforedawn.com

Differences between Nutch and Solr

2008-10-22 Thread John Martyniak
Are the main differences between Nutch and Solr, that Solr doesn't have a spider. So in order to use it you would have to spider the web your self, or with some other tool? -John

Is Nutch Still Active?

2008-10-22 Thread John Martyniak
Hi, I have been playing around with Nutch for a little while, and I see a ton of emails on the mailing lists, but there hasn't been a formal build in more than a year. Are there any plans? Is this project still being worked on? Any thoughts would be greatly appreciated. -John

Re: Is Nutch Still Active?

2008-10-22 Thread John Martyniak
Ronny, Thanks for the info. Does you know what the approximate timing for that is (Days, weeks, months)? And also the feature set. -John On Oct 21, 2008, at 7:50 PM, RONNY wrote: Nutch is too young a project to die the men are finalizing version 1.0 Ronny John Martyniak wrote: Hi, I

Re: Is Nutch Still Active?

2008-10-22 Thread John Martyniak
others. Not dead, just busy. Dennis John Martyniak wrote: Ronny, Thanks for the info. Does you know what the approximate timing for that is (Days, weeks, months)? And also the feature set. -John On Oct 21, 2008, at 7:50 PM, RONNY wrote: Nutch is too young a project to die the men

Additional URL Content

2008-10-30 Thread John Martyniak
Hello everyone, Part of the requirements for a site that I am working on is that I have some information in a DB and some in a nutch index. The nutch index obviously contains the indexed URLs, etc. However I also have a DB that contains the URLs and a bunch of other information about

Segment size and maintenance

2008-10-30 Thread John Martyniak
Hello everyone, I am building a big index. And I have many segments, and I have a couple of questions regarding segment/index maintenance. Is there a practical limit to the size of segments? Right now I have several segment with around 50K links and several with 25K links so the total

site: ??

2008-10-30 Thread John Martyniak
Is there anyway to have site perform as it does on google, so if you put in site: it shows all of the pages for a given site that are in the index. Or is there another way to achieve the same functionality. -John

Re: site: ??

2008-10-30 Thread John Martyniak
+bcdtravel.mg+hitsPerPage=10lang=en Regards Ronny John Martyniak wrote: Is there anyway to have site perform as it does on google, so if you put in site: it shows all of the pages for a given site that are in the index. Or is there another way to achieve the same functionality. -John

Re: Parase nutch results

2008-11-04 Thread John Martyniak
? De: John Martyniak [EMAIL PROTECTED] Para: nutch-user@lucene.apache.org Enviado: martes, 4 de noviembre, 2008 21:05:08 Asunto: Re: Parase nutch results Well the Nutch open Search util does have the ability to return results in RSS XML. Have you installed nutch and have

Re: Indexing News groups

2008-11-20 Thread John Martyniak
it directly to Solr for indexing. No need for intermediate DB, XML files, etc. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch From: John Martyniak [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Thursday, November 20, 2008 4:12:10 PM

Re: Indexing News groups

2008-11-20 Thread John Martyniak
a mini newsgroup reader, mimic its behaviour, but then once you grab a post you could send it directly to Solr for indexing. No need for intermediate DB, XML files, etc. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch From: John Martyniak

Re: How to effectively stop indexing javascript pages ending with .js

2008-12-02 Thread John Martyniak
That is good information. Because I too have the same issue, I don't want the js files in the index. But what if you already have a bunch of .js files in your segments and want to remove them from the index/segments. is there anyway to effectively do that as well? -John On Dec 2,

Re: How to effectively stop indexing javascript pages ending with .js

2008-12-02 Thread John Martyniak
That will be awesome. Will there be a limit to the number of domains that can be included? -John On Dec 2, 2008, at 3:27 PM, Dennis Kubes wrote: I am in the process of writing a domain-urlfilter. It will allow fetching only from a list of top level domains. Should have a patch out

Re: How to effectively stop indexing javascript pages ending with .js

2008-12-02 Thread John Martyniak
That sounds like a good feature. Will this be in the 1.0 release? -John On Dec 2, 2008, at 5:17 PM, Dennis Kubes wrote: John Martyniak wrote: That will be awesome. Will there be a limit to the number of domains that can be included? Only on what can be stored in memory in a set. So

Re: Fetching only unfetched URLs

2008-12-04 Thread John Martyniak
updated or refetch as well? Is generate-fetch-update methodology means to run a new crawler and merge with older one? Thanks ian -- From: John Martyniak [EMAIL PROTECTED] Sent: Thursday, December 04, 2008 2:01 PM To: nutch-user@lucene.apache.org

Re: Crawl Timing_Please help

2009-01-02 Thread John Martyniak
Neil, That shouldn't take that long, when I do a 50K page crawl it takes a few hours. Have you tried it again? Without the issues that you where talking about. To my knowledge I don't think that you can restart a crawl from where it left off, it would be a cool feature. Also is this

Re: next nutch relase

2009-01-04 Thread John Martyniak
Awesome. Thanks for the update. -John On Jan 4, 2009, at 8:48 AM, Dennis Kubes wrote: Soon. We are working hard on getting current patches committed and issues resolved. 1-2 weeks. Dennis Boris Shulman wrote: Does anyone knows when next nutch release is expected?

Compiling from Source

2009-02-02 Thread John Martyniak
I am try to compile nutch from source for the first time, it is on a Mac running 10.5.6, and I just downloaded source, so it is a new copy. And I get the following error: [javac] /Users/jmartyniak/Projects/java/nutch-trunk/src/java/org/ apache/nutch/analysis/AnalyzerFactory.java:29:

Re: Compiling from Source

2009-02-02 Thread John Martyniak
PM, Doğacan Güney wrote: On Mon, Feb 2, 2009 at 10:08 PM, John Martyniak j...@beforedawn.com wrote: I am try to compile nutch from source for the first time, it is on a Mac running 10.5.6, and I just downloaded source, so it is a new copy. And I get the following error: [javac] /Users

Re: Compiling from Source

2009-02-03 Thread John Martyniak
, John Martyniak j...@beforedawn.com wrote: Thanks for the response. I was using 1.5 so I updated that to the following: java version 1.6.0_07 Java(TM) SE Runtime Environment (build 1.6.0_07-b06-153) Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_07-b06-57, mixed mode) And I still get the exact

prioritizing urls and changing the re-fetch interval

2009-02-10 Thread John Martyniak
I would like to find out if there is anyway in either 0.9 or 1.0 change the priority of URLS that need to be fetched. And on the same note to change the re-fetch interval. I thought that it could be done by re-injecting the URL that you would like to re-fetch. But I had read recently

Keeping content fresh

2009-03-03 Thread John Martyniak
How does nutch determine when content needs to be re-fetched? The way that I understand it is that it is next fetch date which 7 days in the future. Is there anyway to change that? Or to increase the fetching interval. Or somehow base it on how many times a piece of content is

Re: Keeping content fresh

2009-03-03 Thread John Martyniak
know that many people are makeing generate, fetch, etc. loops very often to make sites fresh John Martyniak pisze: Justin, thanks for the info this very helpful. This value would apply to all pages though. I was thinking that if you have things like youtube.com, cnn.com, etc in your index

Re: Keeping content fresh

2009-03-03 Thread John Martyniak
didn't test that one so can tell you how it works. I know that many people are makeing generate, fetch, etc. loops very often to make sites fresh John Martyniak pisze:- 显示引用文字 - Justin, thanks for the info this very helpful. This value would apply to all pages though. I was thinking

Re: Keeping content fresh

2009-03-03 Thread John Martyniak
can try build from last week it's much better than 0.9. It looks that they should release rc1 in 2-3 weeks John Martyniak pisze: Thank you, that sounds like a close match to what I am looking for. It looks like it is part of 1.0, I am only using 0.9 at this time. But I think I read that a RC1

Re: what is needed to index for about 10000 domains

2009-03-03 Thread John Martyniak
I think that in order to answer that questions, it is necessary to know how many total pages are being indexed. I currently have ~3.5 million pages indexed, and the segment directories are around 45GB, The response time is relatively fast. In the test site it is running on a dual processor

Re: what is needed to index for about 10000 domains

2009-03-03 Thread John Martyniak
shell script that we have on wiki. It gave a lot of errors. I run it on cygwin though. Thanks. A. -Original Message- From: John Martyniak j...@beforedawn.com To: nutch-user@lucene.apache.org Sent: Tue, 3 Mar 2009 1:44 pm Subject: Re: what is needed to index for about 1

Re: what is needed to index for about 10000 domains

2009-03-03 Thread John Martyniak
to index the domains I have? Why only segments? I thought we need to merge all sub folders under crawl folder. What did you use for merging them? Thanks. A. -Original Message- From: John Martyniak j...@beforedawn.com To: nutch-user@lucene.apache.org Sent: Tue, 3 Mar 2009 3:21 pm

Re: The Future of Nutch

2009-03-13 Thread John Martyniak
Dennis, I am with you, I am building a large scale www search engine. But might also build a vertical search as well. Aren't the requirements the same for building a large scale www search, against building a vertical www search, the only thing that seems to change is the scope. I like

Re: The Future of Nutch

2009-03-14 Thread John Martyniak
I think that this would be the case for making Nutch a top level Apache Project. So that you can publish the framework and a complete app but still tie it all together. Because personally I think that is the strength of Nutch, that you can use it right out of the box, without

Merge taking forever

2009-06-03 Thread John Martyniak
, -John John Martyniak President/CEO Before Dawn Solutions, Inc. 9457 S. University Blvd #266 Highlands Ranch, CO 80126 o: 877-499-1562 c: 303-522-1756 e: j...@beforedawnsoutions.com w: http://www.beforedawnsolutions.com

Re: Merge taking forever

2009-06-03 Thread John Martyniak
and are in good completed state, this merge would be very fast, because no sorting would be required. It would be very useful, too, because it seems that this simple use is what people need. Regards, Arkadi -Original Message- From: John Martyniak [mailto:j...@beforedawnsolutions.com

Re: Merge taking forever

2009-06-04 Thread John Martyniak
___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com John Martyniak President/CEO Before Dawn Solutions, Inc. 9457 S. University Blvd #266 Highlands Ranch, CO 80126 o: 877-499-1562 c: 303-522-1756 e: j...@beforedawnsoutions.com w: http

Re: Merge taking forever

2009-06-04 Thread John Martyniak
. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com John Martyniak President/CEO Before

Re: Merge taking forever

2009-06-04 Thread John Martyniak
___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com John Martyniak President/CEO Before Dawn Solutions, Inc. 9457 S. University Blvd #266 Highlands Ranch, CO 80126 o: 877-499-1562 c

Re: Merge taking forever

2009-06-04 Thread John Martyniak
Arkady, I think that is beauty of nutch I have built a index of a little more 6 million urls with out of the box Nutch. I would say that is pretty good for most situations before you have to start getting into hadoop and multiple machines. -John On Jun 4, 2009, at 5:19 PM,

Re: Merge taking forever

2009-06-05 Thread John Martyniak
- 2009/6/5 John Martyniak j...@beforedawnsolutions.com Arkady, I think that is beauty of nutch I have built a index of a little more 6 million urls with out of the box Nutch. I would say that is pretty good for most situations before you have to start getting into hadoop and multiple machines

Re: Merge taking forever

2009-06-06 Thread John Martyniak
jobs to 3, but when I restarted all it is still only using 2 and 1. Any ideas? I made that change in the hadoop-site.xml file BTW. -John On Jun 4, 2009, at 10:00 AM, Andrzej Bialecki wrote: John Martyniak wrote: Andrzej, I am a little embarassed asking. But is there is a setup guide

Re: Merge taking forever

2009-06-12 Thread John Martyniak
, John Martyniak j...@beforedawnsolutions.com wrote: Ok. So a update to this item. I did start running nutch with hadoop, I am trying a single node config just to test it out. It took forever to get all of the files in the DFS it was just over 80GB but it is in there. So I started

Re: Merge taking forever

2009-06-15 Thread John Martyniak
will fix it ? Is there a good config guide for the distributed mode. 2009/6/12 Justin Yao jyaoj...@gmail.com Hi John, I have no idea about that neither. Justin On Fri, Jun 12, 2009 at 8:05 AM, John Martyniak j...@beforedawnsolutions.com wrote: Justin, Thanks for the response. I was having

Re: Merge taking forever

2009-06-17 Thread John Martyniak
Hi John, I have no idea about that neither. Justin On Fri, Jun 12, 2009 at 8:05 AM, John Martyniak j...@beforedawnsolutions.com wrote: Justin, Thanks for the response. I was having a similar issue, i was trying to merge the segments for crawls during the month of may probably

Nutch 0.20

2009-11-10 Thread John Martyniak
Hi everyone, Does anybody know of a good way or is it possible to run nutch on Hadoop 0.20.x? thank you, -John

Nutch 0.19.2 and Ganglia 3.1.3

2009-11-17 Thread John Martyniak
Has anybody else had any trouble running nutch 0.19.2 with Ganglia 3.1.3? I was surfing through Jira and it seems that there where some issues but they have been resolved. Any thoughts would be helpful. Thank you, -John John Martyniak President/CEO Before Dawn Solutions, Inc. 9457 S

Re: Nutch 0.19.2 and Ganglia 3.1.3

2009-11-17 Thread John Martyniak
John Martyniak wrote: Has anybody else had any trouble running nutch 0.19.2 with Ganglia 3.1.3? I was surfing through Jira and it seems that there where some issues but they have been resolved. Any thoughts would be helpful. Thank you, -John John Martyniak President/CEO Before Dawn Solutions, Inc

Nutch upgrade to Hadoop

2009-11-19 Thread John Martyniak
Does anybody know of any concrete plans to update Nutch to Hadoop 0.20, 0.21? Something like a Nutch 1.1 release, get in some bug fixes and get current on Hadoop? I think that should be one of the goals. My 2 cents. -John

Re: Nutch upgrade to Hadoop

2009-11-20 Thread John Martyniak
That is good news.Thank you. I have been debating whether to update my cluster to 0.21...and now I can. -John On Nov 20, 2009, at 2:45 AM, Andrzej Bialecki a...@getopt.org wrote: John Martyniak wrote: Does anybody know of any concrete plans to update Nutch to Hadoop 0.20, 0.21

New version of nutch?

2010-03-03 Thread John Martyniak
Does anybody have an idea of when a new version of nutch will be availale, specifically supporting a latest version of hadoop. And possibly hbase? Thank you for any information. -John

Re: New version of nutch?

2010-03-03 Thread John Martyniak
Andrezj, Thanks for the information. I anxiously await the update. The HBase integration would be a nice to have but I don't think that it should hold up the release. -John On Wed, Mar 3, 2010 at 5:04 PM, Andrzej Bialecki a...@getopt.org wrote: On 2010-03-03 20:12, John Martyniak wrote