Re: Nutch crawler issue with more depth value

2019-01-24 Thread Renato Marroquín Mogrovejo
Hi there, Can I ask you which backend you are using? If it is HBase, then you have update the max KeyValue size configuration. This configuration is on the hbase-site.xml file which by default is 10MB hbase.client.keyvalue.maxsize 10485760 I am copying the Gora mailing list as well, as they

Re: [ANNOUNCE] Apache Gora 0.6.1 Release

2015-09-15 Thread Renato Marroquín Mogrovejo
Awesome news Lewis! Thanks for driving this! Renato M. 2015-09-15 8:26 GMT+02:00 lewis john mcgibbney : > Hi All, > > The Apache Gora team are pleased to announce the immediate availability of > Apache Gora 0.6.1. > > What is Gora? > Gora is a framework which provides an

Re: [VOTE] Release Apache Nutch 2.3

2015-01-10 Thread Renato Marroquín Mogrovejo
Tests pass and signature looks good. Here is my +1 (non-binding) Thanks for driving this Lewis! Renato M. 2015-01-09 9:58 GMT+01:00 Lewis John Mcgibbney lewis.mcgibb...@gmail.com: Hi user@ dev@, This thread is a VOTE for releasing Apache Nutch 2.3. Quite incredibly we addressed 143 issues

Re: Nutch works on Hadoop 2.5.2 with Hbase 0.98.8

2015-01-05 Thread Renato Marroquín Mogrovejo
Great work Talat! Please post your updated patches in the issues related to this (for the Gora side), and I will more than happy to review them. Thanks a lot! Renato M. 2015-01-05 10:20 GMT+01:00 Talat Uyarer ta...@uyarer.com: Hi Folks, I selected this subject for taking attention. A lot of

Re: Potential Bug in 2.X HostDbUpdateReducer

2014-12-08 Thread Renato Marroquín Mogrovejo
Hi Lewis, From quickly checking out the code (Host.java + HostDB + HostDBUpdateReducer) it would seems like there is a bug exactly where you pointed. Renato M. 2014-12-08 20:53 GMT+01:00 Lewis John Mcgibbney lewis.mcgibb...@gmail.com: Hi Folks, I was looking into the code within Nutch 2.X

Re: Can't run Nutch2 on Hadoop2 (Nutch 2.x + Hadoop 2.4.0 + HBase 0.94.18 + Gora 0.5 + Avro 1.7.6)

2014-10-04 Thread Renato Marroquín Mogrovejo
Hi Alex, Just a quick question, why are you suing Gora from com.argonio.gora? Why don't you use the Apache one? org.apache.gora? And could you tell us what is exactly going wrong? Renato M. 2014-10-04 14:03 GMT+02:00 k4200 k4...@kazu.tv: Hi Alex, But info about another experiences with

Re: Crawled data not inserting in the tables

2014-09-30 Thread Renato Marroquín Mogrovejo
Hi Kartik, If TTL hasn't been set or if it has been set to 0, then Gora is not using any TTL[1] and all your data should be persisted without any problems. Maybe this has to do something with the url generating/fetching process? Could you determine during which process the data is changing?

Re: unable to create new column families with Cassandra/Nutch

2014-09-10 Thread Renato Marroquín Mogrovejo
Hi Kartik, Could you please try adding this property to gora.properties file? 2014-09-10 7:23 GMT+02:00 kkrishnanand kartik.krishnan...@bankofamerica.com : Hi, Nutch Gurus, I am trying to create set up Nutch with Gora and Cassandra using the tutorial as follows. I am not having any

Re: unable to create new column families with Cassandra/Nutch

2014-09-10 Thread Renato Marroquín Mogrovejo
Sorry pressed send too soon. This is the property that I was talking about. gora.datastore.autocreateschema=true Thanks! Renato M. 2014-09-10 10:59 GMT+02:00 Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com: Hi Kartik, Could you please try adding this property to gora.properties

Re: unable to create new column families with Cassandra/Nutch

2014-09-10 Thread Renato Marroquín Mogrovejo
. . . . -Original Message- From: Renato Marroquín Mogrovejo [mailto:renatoj.marroq...@gmail.com] Sent: Wednesday, September 10, 2014 2:00 AM To: Nutch Users Subject: Re: unable to create new column families with Cassandra/Nutch Sorry pressed send too soon. This is the property that I

Re: New Apache Nutch Site

2014-06-11 Thread Renato Marroquín Mogrovejo
Cool job Lewis! 2014-06-11 11:32 GMT+02:00 Markus Jelsma markus.jel...@openindex.io: Awesome!!! -Original message- From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com Sent:Wed 11-06-2014 06:13 Subject:New Apache Nutch Site To:user@nutch.apache.org; d...@nutch.apache.org; Hi

Re: Reading from Hbase

2014-05-29 Thread Renato Marroquín Mogrovejo
Hi Murali, You could use Nutch 2.x as Julien told you which already uses Gora. This gives you some advantages like being able to read the data with MapReduce or directly from the data source you have chosen. Besides this, Gora integrates with Giraph which would give you the chance to run graph

Re: Nutch 2.x- Hbase - Solr Configuration

2014-05-13 Thread Renato Marroquín Mogrovejo
to gora.properties:gora.datastore.default=org.apache.gora.hbase.store.HBaseStore Thank you. David. On Tue, Apr 22, 2014 at 5:50 PM, Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com wrote: Hi David, So where are you running this from? command-line? or eclipse? I

Re: [ANNOUNCEMENT] Apache Gora 0.4 Release

2014-04-23 Thread Renato Marroquín Mogrovejo
Great news for sure! Renato M. 2014-04-23 14:56 GMT+02:00 Julien Nioche lists.digitalpeb...@gmail.com: Great news! Well done and thanks to everyone involved. I am sure this will be popular with the Nutch 2.x users. BTW I can smell a rematch of

Re: Nutch 2.x- Hbase - Solr Configuration

2014-04-22 Thread Renato Marroquín Mogrovejo
Hi David, So where are you running this from? command-line? or eclipse? I think your classpath is missing the necessary files. Are you still getting the same exception as before? like if the changes you did took no where? This is probably because the gora.properties file being picked up inside

Re: [WELCOME] Nutch PMC Welcomes Talat Uyarer to PMC and Committer

2014-04-02 Thread Renato Marroquín Mogrovejo
Congrats Talat! Well deserved! Renato M. 2014-04-02 8:56 GMT+02:00 Talat Uyarer ta...@uyarer.com: Hi All, I am very excited now. :) Thanks a lot to everyone for inviting me. I'm a software engineer and crawler team leader of my company in Istanbul. I have been using Apache Nutch 2.X for

Re: Book of Nutch

2014-03-19 Thread Renato Marroquín Mogrovejo
Hey Talat, So what was the issue with this book? Renato M. 2014-03-18 10:21 GMT+01:00 Talat Uyarer ta...@uyarer.com: Hi All, Some write a book about Nutch. I saw in Gora issue. http://www.packtpub.com/web-crawling-and-data-mining-with-apache-nutch/book -- Talat UYARER Websitesi:

Re: Book of Nutch

2014-03-19 Thread Renato Marroquín Mogrovejo
: Getting Started with Data Mining Web Crawling or something, I am still trying to find time to review properly On Wed, Mar 19, 2014 at 8:52 AM, Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com wrote: Hey Talat, So what was the issue

Re: HBase Pseudo distributed or Fully distributed mode for Nutch 2.2.1? fully-distributed

2013-10-05 Thread Renato Marroquín Mogrovejo
Hi, if you are using a single server for production then you will be using a single server for Hbase as well right? So you should use standalone mode as you will use the file system directly. Pseudo distributed mode could be another option but probably would have more overhead and no advantage in

Re: HBase Pseudo distributed or Fully distributed mode for Nutch 2.2.1? fully-distributed

2013-10-05 Thread Renato Marroquín Mogrovejo
As Talad said maybe 300k should work fairly OK. What is your hardware like? Is HBase and Nutch inside the same server? Renato M. On Oct 5, 2013 10:08 PM, A Laxmi a.lakshmi...@gmail.com wrote: Thanks Talat! Renato - Thanks for your reply! I have tried Standalone mode but I have had lot of

Re: hBase + Nutch - timeout or session expiration while injecting

2013-09-23 Thread Renato Marroquín Mogrovejo
Hi Glumet, Does this [1] seem familiar? Renato M. [1] http://wiki.apache.org/nutch/ErrorMessagesInNutch2?highlight=%28zookeeper.ClientCnxn%29 2013/9/21 glumet jan.bouch...@gmail.com Hello, I use hbase (0.90.4) as my storage for pages crawled by Nutch 2.2.1. Everything worked fine but

Re: HBase version recommended for Nutch 2.2.1

2013-08-30 Thread Renato Marroquín Mogrovejo
Hi Jonathan, A quick question about this. Did you run this out-of-the-box? Or did you do any modifications? This is because Hbase 0.92.2 uses Avro 1.5.3 and Gora uses Avro 1.3.3. Renato M. 2013/8/29 Jonathan.Wei 252637...@qq.com You can use hbase 0.92.2 and hadoop1.1.1! -- View this

Re: 回复: HBase version recommended for Nutch 2.2.1

2013-08-30 Thread Renato Marroquín Mogrovejo
So no Gora? Renato M. 2013/8/30 基勇 252637...@qq.com I not use Avro I only use Hbase0.92.2 and hadoop1.1.1 ! I'm sorry can't help you ! -- 原始邮件 -- 发件人: Renato Marroquín Mogrovejrenatoj.marroq...@gmail.com; 发送时间: 2013年8月31日(星期六) 中午11:13 收件人: Nutch

Re: 回复: HBase version recommended for Nutch 2.2.1

2013-08-30 Thread Renato Marroquín Mogrovejo
Thanks Jonathan! Renato M. 2013/8/30 基勇 252637...@qq.com I think you can replace the version under the avro to try! Gora0.3 default use hbase0.90.4. I use hbase0.92.2 to replace and operating normally You can try it! -- 原始邮件 -- 发件人: Renato

Re: HBase version recommended for Nutch 2.2.1

2013-08-28 Thread Renato Marroquín Mogrovejo
Hi, It is 0.90.4 Renato M. 2013/8/28 A Laxmi a.lakshmi...@gmail.com What is a recommended (stable) *HBase *version that I can use for *Nutch 2.2.1*? Thanks for any help!

Re: [ANNOUNCE] Apache Nutch v1.7 Released

2013-06-27 Thread Renato Marroquín Mogrovejo
Great work Lewis! 2013/6/27 Julien Nioche lists.digitalpeb...@gmail.com Thanks Lewis for taking care of the release. Great stuff! Julien On 27 June 2013 00:38, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: N.B. Previous message doesn't seem to have been mod'd through under my

Re: [VOTE] Apache Nutch 2.2.1 RC#1

2013-06-27 Thread Renato Marroquín Mogrovejo
+1 (non-binding) 2013/6/27 Lewis John Mcgibbney lewis.mcgibb...@gmail.com Hi, It would be greatly appreciated if you could take some time to VOTE on the release candidate for the Apache Nutch 2.2.1 artifacts. This candidate is (amongst other things) a bug fix for NUTCH-1591 - Incorrect

Re: [VOTE] Apache Nutch 2.2 Release Candidate

2013-06-03 Thread Renato Marroquín Mogrovejo
+1 (non-binding) On Jun 3, 2013 12:30 AM, Mattmann, Chris A (398J) chris.a.mattm...@jpl.nasa.gov wrote: I just modded it through :) ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA

Re: Fetching a specific number of urls

2013-05-16 Thread Renato Marroquín Mogrovejo
/16 Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com: Hi Tejas, Thank you very much for your help again. But I'm sorry to inform that I am still not able to get the next link into my crawldb. I am thinking that my conf/regex-urlfilter.txt file is not properly set up. I am sending

Re: Nutch 2.1 seed list

2013-05-15 Thread Renato Marroquín Mogrovejo
is store in the metadata of current url, as part of the metadata of current url. On Fri, May 10, 2013 at 10:59 PM, Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com wrote: Hi Feng, So this means I could put any type of information for the seed urls but what about the ones

Re: Nutch 2.1 seed list

2013-05-14 Thread Renato Marroquín Mogrovejo
url. On Fri, May 10, 2013 at 10:59 PM, Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com wrote: Hi Feng, So this means I could put any type of information for the seed urls but what about the ones fetched in the next cycles? They won't have any of this information right? And where

Fetching a specific number of urls

2013-05-12 Thread Renato Marroquín Mogrovejo
Hi all, I have been trying to fetch a query similar to: http://www.xyz.com/?page=1 But where the number can vary from 1 to 100. Inside the first page there are links to the next ones. So I updated the conf/regex-urlfilter file and added: ^[0-9]{1,45}$ When I do this, the generate job fails

Re: Fetching a specific number of urls

2013-05-12 Thread Renato Marroquín Mogrovejo
, 2013 at 12:40 AM, Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com wrote: Hi all, I have been trying to fetch a query similar to: http://www.xyz.com/?page=1 But where the number can vary from 1 to 100. Inside the first page there are links to the next ones. So I updated the conf

Re: Fetching a specific number of urls

2013-05-12 Thread Renato Marroquín Mogrovejo
And I did try the commands you told me but I am not sure how they work. They do wait for an url to be input, but then it prints the url with a '+' at the beginning, what does that mean? http://www.xyz.com/lanchon +http://www.xyz.com/lanchon 2013/5/12 Renato Marroquín Mogrovejo renatoj.marroq

Re: Fetching a specific number of urls

2013-05-12 Thread Renato Marroquín Mogrovejo
rules accept one input url at a time from console (you need to type/paste the url and hit enter). It shows + if the url is accepted by the current rules. (- for rejection). Thanks, Tejas On Sun, May 12, 2013 at 10:30 AM, Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com wrote

Re: Nutch 2.1 seed list

2013-05-10 Thread Renato Marroquín Mogrovejo
Hi Feng, So this means I could put any type of information for the seed urls but what about the ones fetched in the next cycles? They won't have any of this information right? And where is this information stored? As part of the fetched or the parsed information? Thanks! Renato M. On May 10,

Re: Using Nutch and Hive together

2013-05-01 Thread Renato Marroquín Mogrovejo
the firmest grasp of the internals. I'm not sure how useful I'll be at this moment. On Tue, Apr 30, 2013 at 6:21 PM, Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com wrote: Hi Yves, Apache Gora does not support Apache Hive just yet, but we have it on our future plans. If you were

Re: Using Nutch and Hive together

2013-04-30 Thread Renato Marroquín Mogrovejo
Hi Yves, Apache Gora does not support Apache Hive just yet, but we have it on our future plans. If you were willing to dive into an adventure with Gora we would be happy to help you out with that. There is a Pig-Gora adapter patch on JIRA, maybe you would like to give it a look? Although there is

Re: Trying to output to db in MS-SQL on Azure

2013-04-16 Thread Renato Marroquín Mogrovejo
Hi all, I think the problem is the drivers Microsoft Azure requires. Right now Gora supports HDSQL but Azure needs some of the following clients [1], and right now we have no support in any of those clients :( Maybe you could try playing around with these clients inside Gora and give it a shot? I

Re: Something for the weekend

2013-02-28 Thread Renato Marroquín Mogrovejo
Cool! I will definitely use this to play with Apache Nutch. Thanks Lewis! Renato M. 2013/2/28 Lewis John Mcgibbney lewis.mcgibb...@gmail.com: Hi, I pushed a real simple script which I use as a cron job to bootsrtrap Apache Nutch with 1M URLs every day. For those wanting to crawl, test, use

Re: nutch-2.1 with hbase - any good tool for querying results?

2013-02-27 Thread Renato Marroquín Mogrovejo
Hi there, There are two options, and it will depend on how you would like to access the data. If you would like to access this using a data flow (Pig directly) you can find tons of information e.g. [1], but if you want to use a JDBC-like access, you can use Gora ;) Let us know what you feel like

Re: Slow parse on hadoop

2013-02-16 Thread Renato Marroquín Mogrovejo
Hi Tejas, 2013/2/16 Tejas Patil tejas.patil...@gmail.com: Hey Lewis, I am not knowledgeable about Gora thingy but am curious to know how parsing perf. might affect if one uses different storage. With Hbase it worked fine for OP but Cassandra gave this problem. Is the parsing code separate

Web pages parsed status

2012-12-08 Thread Renato Marroquín Mogrovejo
Hi all, I have started playing around with Apache Nutch, but I think this output is a strange because I would actually like to fetch the content. Is there any configuration or a simple step I might be missing? baseUrl:null status: 1 (status_unfetched) fetchInterval: 2592000 fetchTime:

Re: Web pages parsed status

2012-12-08 Thread Renato Marroquín Mogrovejo
command/script or individually via a custom script? We advise against using the deprecated Crawl class in both distributions. Best Lewis On Sun, Dec 9, 2012 at 12:26 AM, Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com wrote: Hi all, I have started playing around with Apache Nutch