Hi there,
Can I ask you which backend you are using?
If it is HBase, then you have update the max KeyValue size configuration.
This configuration is on the hbase-site.xml file which by default is 10MB
hbase.client.keyvalue.maxsize
10485760
I am copying the Gora mailing list as well, as they
Awesome news Lewis! Thanks for driving this!
Renato M.
2015-09-15 8:26 GMT+02:00 lewis john mcgibbney :
> Hi All,
>
> The Apache Gora team are pleased to announce the immediate availability of
> Apache Gora 0.6.1.
>
> What is Gora?
> Gora is a framework which provides an
Tests pass and signature looks good.
Here is my +1 (non-binding) Thanks for driving this Lewis!
Renato M.
2015-01-09 9:58 GMT+01:00 Lewis John Mcgibbney lewis.mcgibb...@gmail.com:
Hi user@ dev@,
This thread is a VOTE for releasing Apache Nutch 2.3.
Quite incredibly we addressed 143 issues
Great work Talat! Please post your updated patches in the issues related to
this (for the Gora side), and I will more than happy to review them.
Thanks a lot!
Renato M.
2015-01-05 10:20 GMT+01:00 Talat Uyarer ta...@uyarer.com:
Hi Folks,
I selected this subject for taking attention. A lot of
Hi Lewis,
From quickly checking out the code (Host.java + HostDB +
HostDBUpdateReducer) it would seems like there is a bug exactly where you
pointed.
Renato M.
2014-12-08 20:53 GMT+01:00 Lewis John Mcgibbney lewis.mcgibb...@gmail.com:
Hi Folks,
I was looking into the code within Nutch 2.X
Hi Alex,
Just a quick question, why are you suing Gora from com.argonio.gora? Why
don't you use the Apache one? org.apache.gora? And could you tell us what
is exactly going wrong?
Renato M.
2014-10-04 14:03 GMT+02:00 k4200 k4...@kazu.tv:
Hi Alex,
But info about another experiences with
Hi Kartik,
If TTL hasn't been set or if it has been set to 0, then Gora is not using
any TTL[1] and all your data should be persisted without any problems.
Maybe this has to do something with the url generating/fetching process?
Could you determine during which process the data is changing?
Hi Kartik,
Could you please try adding this property to gora.properties file?
2014-09-10 7:23 GMT+02:00 kkrishnanand kartik.krishnan...@bankofamerica.com
:
Hi, Nutch Gurus,
I am trying to create set up Nutch with Gora and Cassandra using the
tutorial as follows. I am not having any
Sorry pressed send too soon. This is the property that I was talking about.
gora.datastore.autocreateschema=true
Thanks!
Renato M.
2014-09-10 10:59 GMT+02:00 Renato Marroquín Mogrovejo
renatoj.marroq...@gmail.com:
Hi Kartik,
Could you please try adding this property to gora.properties
. . . .
-Original Message-
From: Renato Marroquín Mogrovejo [mailto:renatoj.marroq...@gmail.com]
Sent: Wednesday, September 10, 2014 2:00 AM
To: Nutch Users
Subject: Re: unable to create new column families with Cassandra/Nutch
Sorry pressed send too soon. This is the property that I
Cool job Lewis!
2014-06-11 11:32 GMT+02:00 Markus Jelsma markus.jel...@openindex.io:
Awesome!!!
-Original message-
From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com
Sent:Wed 11-06-2014 06:13
Subject:New Apache Nutch Site
To:user@nutch.apache.org; d...@nutch.apache.org;
Hi
Hi Murali,
You could use Nutch 2.x as Julien told you which already uses Gora. This
gives you some advantages like being able to read the data with MapReduce
or directly from the data source you have chosen. Besides this, Gora
integrates with Giraph which would give you the chance to run graph
to
gora.properties:gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
Thank you.
David.
On Tue, Apr 22, 2014 at 5:50 PM, Renato Marroquín Mogrovejo
renatoj.marroq...@gmail.com wrote:
Hi David,
So where are you running this from? command-line? or eclipse? I
Great news for sure!
Renato M.
2014-04-23 14:56 GMT+02:00 Julien Nioche lists.digitalpeb...@gmail.com:
Great news! Well done and thanks to everyone involved. I am sure this will
be popular with the Nutch 2.x users.
BTW I can smell a rematch of
Hi David,
So where are you running this from? command-line? or eclipse? I think your
classpath is missing the necessary files.
Are you still getting the same exception as before? like if the changes you
did took no where? This is probably because the gora.properties file being
picked up inside
Congrats Talat! Well deserved!
Renato M.
2014-04-02 8:56 GMT+02:00 Talat Uyarer ta...@uyarer.com:
Hi All,
I am very excited now. :) Thanks a lot to everyone for inviting me.
I'm a software engineer and crawler team leader of my company in
Istanbul. I have been using Apache Nutch 2.X for
Hey Talat,
So what was the issue with this book?
Renato M.
2014-03-18 10:21 GMT+01:00 Talat Uyarer ta...@uyarer.com:
Hi All,
Some write a book about Nutch. I saw in Gora issue.
http://www.packtpub.com/web-crawling-and-data-mining-with-apache-nutch/book
--
Talat UYARER
Websitesi:
: Getting Started with
Data
Mining Web Crawling
or something, I am still trying to find time to review properly
On Wed, Mar 19, 2014 at 8:52 AM, Renato Marroquín Mogrovejo
renatoj.marroq...@gmail.com wrote:
Hey Talat,
So what was the issue
Hi,
if you are using a single server for production then you will be using a
single server for Hbase as well right? So you should use standalone mode as
you will use the file system directly. Pseudo distributed mode could be
another option but probably would have more overhead and no advantage in
As Talad said maybe 300k should work fairly OK.
What is your hardware like? Is HBase and Nutch inside the same server?
Renato M.
On Oct 5, 2013 10:08 PM, A Laxmi a.lakshmi...@gmail.com wrote:
Thanks Talat!
Renato - Thanks for your reply! I have tried Standalone mode but I have had
lot of
Hi Glumet,
Does this [1] seem familiar?
Renato M.
[1]
http://wiki.apache.org/nutch/ErrorMessagesInNutch2?highlight=%28zookeeper.ClientCnxn%29
2013/9/21 glumet jan.bouch...@gmail.com
Hello,
I use hbase (0.90.4) as my storage for pages crawled by Nutch 2.2.1.
Everything worked fine but
Hi Jonathan,
A quick question about this. Did you run this out-of-the-box? Or did you do
any modifications?
This is because Hbase 0.92.2 uses Avro 1.5.3 and Gora uses Avro 1.3.3.
Renato M.
2013/8/29 Jonathan.Wei 252637...@qq.com
You can use hbase 0.92.2 and hadoop1.1.1!
--
View this
So no Gora?
Renato M.
2013/8/30 基勇 252637...@qq.com
I not use Avro
I only use Hbase0.92.2 and hadoop1.1.1 !
I'm sorry can't help you !
-- 原始邮件 --
发件人: Renato Marroquín Mogrovejrenatoj.marroq...@gmail.com;
发送时间: 2013年8月31日(星期六) 中午11:13
收件人: Nutch
Thanks Jonathan!
Renato M.
2013/8/30 基勇 252637...@qq.com
I think you can replace the version under the avro to try!
Gora0.3 default use hbase0.90.4. I use hbase0.92.2 to replace and
operating normally
You can try it!
-- 原始邮件 --
发件人: Renato
Hi,
It is 0.90.4
Renato M.
2013/8/28 A Laxmi a.lakshmi...@gmail.com
What is a recommended (stable) *HBase *version that I can use for *Nutch
2.2.1*?
Thanks for any help!
Great work Lewis!
2013/6/27 Julien Nioche lists.digitalpeb...@gmail.com
Thanks Lewis for taking care of the release. Great stuff!
Julien
On 27 June 2013 00:38, Lewis John Mcgibbney lewis.mcgibb...@gmail.com
wrote:
N.B. Previous message doesn't seem to have been mod'd through under my
+1 (non-binding)
2013/6/27 Lewis John Mcgibbney lewis.mcgibb...@gmail.com
Hi,
It would be greatly appreciated if you could take some time to VOTE on the
release candidate for the Apache Nutch 2.2.1 artifacts. This candidate is
(amongst other things) a bug fix for NUTCH-1591 - Incorrect
+1 (non-binding)
On Jun 3, 2013 12:30 AM, Mattmann, Chris A (398J)
chris.a.mattm...@jpl.nasa.gov wrote:
I just modded it through :)
++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA
/16 Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com:
Hi Tejas,
Thank you very much for your help again.
But I'm sorry to inform that I am still not able to get the next link
into my crawldb. I am thinking that my conf/regex-urlfilter.txt file
is not properly set up. I am sending
is store in the metadata of current url, as part of the
metadata of current url.
On Fri, May 10, 2013 at 10:59 PM, Renato Marroquín Mogrovejo
renatoj.marroq...@gmail.com wrote:
Hi Feng,
So this means I could put any type of information for the seed urls but
what about the ones
url.
On Fri, May 10, 2013 at 10:59 PM, Renato Marroquín Mogrovejo
renatoj.marroq...@gmail.com wrote:
Hi Feng,
So this means I could put any type of information for the seed urls but
what about the ones fetched in the next cycles? They won't have any of this
information right?
And where
Hi all,
I have been trying to fetch a query similar to:
http://www.xyz.com/?page=1
But where the number can vary from 1 to 100. Inside the first page
there are links to the next ones. So I updated the
conf/regex-urlfilter file and added:
^[0-9]{1,45}$
When I do this, the generate job fails
, 2013 at 12:40 AM, Renato Marroquín Mogrovejo
renatoj.marroq...@gmail.com wrote:
Hi all,
I have been trying to fetch a query similar to:
http://www.xyz.com/?page=1
But where the number can vary from 1 to 100. Inside the first page
there are links to the next ones. So I updated the
conf
And I did try the commands you told me but I am not sure how they
work. They do wait for an url to be input, but then it prints the url
with a '+' at the beginning, what does that mean?
http://www.xyz.com/lanchon
+http://www.xyz.com/lanchon
2013/5/12 Renato Marroquín Mogrovejo renatoj.marroq
rules accept one input url at a time
from console (you need to type/paste the url and hit enter).
It shows + if the url is accepted by the current rules. (- for
rejection).
Thanks,
Tejas
On Sun, May 12, 2013 at 10:30 AM, Renato Marroquín Mogrovejo
renatoj.marroq...@gmail.com wrote
Hi Feng,
So this means I could put any type of information for the seed urls but
what about the ones fetched in the next cycles? They won't have any of this
information right?
And where is this information stored? As part of the fetched or the parsed
information?
Thanks!
Renato M.
On May 10,
the firmest grasp of the internals. I'm not sure how useful
I'll be at this moment.
On Tue, Apr 30, 2013 at 6:21 PM, Renato Marroquín Mogrovejo
renatoj.marroq...@gmail.com wrote:
Hi Yves,
Apache Gora does not support Apache Hive just yet, but we have it on
our future plans. If you were
Hi Yves,
Apache Gora does not support Apache Hive just yet, but we have it on
our future plans. If you were willing to dive into an adventure with
Gora we would be happy to help you out with that.
There is a Pig-Gora adapter patch on JIRA, maybe you would like to
give it a look? Although there is
Hi all,
I think the problem is the drivers Microsoft Azure requires. Right now
Gora supports HDSQL but Azure needs some of the following clients [1],
and right now we have no support in any of those clients :(
Maybe you could try playing around with these clients inside Gora and
give it a shot? I
Cool! I will definitely use this to play with Apache Nutch.
Thanks Lewis!
Renato M.
2013/2/28 Lewis John Mcgibbney lewis.mcgibb...@gmail.com:
Hi,
I pushed a real simple script which I use as a cron job to bootsrtrap
Apache Nutch with 1M URLs every day.
For those wanting to crawl, test, use
Hi there,
There are two options, and it will depend on how you would like to
access the data.
If you would like to access this using a data flow (Pig directly) you
can find tons of information e.g. [1], but if you want to use a
JDBC-like access, you can use Gora ;)
Let us know what you feel like
Hi Tejas,
2013/2/16 Tejas Patil tejas.patil...@gmail.com:
Hey Lewis,
I am not knowledgeable about Gora thingy but am curious to know how parsing
perf. might affect if one uses different storage. With Hbase it worked fine
for OP but Cassandra gave this problem. Is the parsing code separate
Hi all,
I have started playing around with Apache Nutch, but I think this
output is a strange because I would actually like to fetch the
content.
Is there any configuration or a simple step I might be missing?
baseUrl:null
status: 1 (status_unfetched)
fetchInterval: 2592000
fetchTime:
command/script or
individually via a custom script? We advise against using the
deprecated Crawl class in both distributions.
Best
Lewis
On Sun, Dec 9, 2012 at 12:26 AM, Renato Marroquín Mogrovejo
renatoj.marroq...@gmail.com wrote:
Hi all,
I have started playing around with Apache Nutch
44 matches
Mail list logo