Re: HBase 0.98 addon for Flink 0.8

Fabian Hueske Mon, 10 Nov 2014 01:56:25 -0800

I don't think we need to bundle the HBase input and output format in a
single PR.
So, I think we can proceed with the IF only and target the OF later.
However, the fix for Kryo should be in the master before merging the PR.
Till is currently working on that and said he expects this to be done by
end of the week.


Cheers, Fabian


2014-11-07 12:49 GMT+01:00 Flavio Pompermaier <[email protected]>:

> I fixed also the profile for Cloudera CDH5.1.3. You can build it with the
> command:
>       mvn clean install -Dmaven.test.skip=true -Dhadoop.profile=2
>  -Pvendor-repos,cdh5.1.3
>
> However, it would be good to generate the specific jar when
> releasing..(e.g.
> flink-addons:flink-hbase:0.8.0-hadoop2-cdh5.1.3-incubating)
>
> Best,
> Flavio
>
> On Fri, Nov 7, 2014 at 12:44 PM, Flavio Pompermaier <[email protected]>
> wrote:
>
> > I've just updated the code on my fork (synch with current master and
> > applied improvements coming from comments on related PR).
> > I still have to understand how to write results back to an HBase
> > Sink/OutputFormat...
> >
> >
> > On Mon, Nov 3, 2014 at 12:05 PM, Flavio Pompermaier <
> [email protected]>
> > wrote:
> >
> >> Thanks for the detailed answer. So if I run a job from my machine I'll
> >> have to download all the scanned data in a table..right?
> >>
> >> Always regarding the GenericTableOutputFormat it is not clear to me how
> >> to proceed..
> >> I saw in the hadoop compatibility addon that it is possible to have such
> >> compatibility using HBaseUtils class so the open method should become
> >> something like:
> >>
> >> @Override
> >> public void open(int taskNumber, int numTasks) throws IOException {
> >> if (Integer.toString(taskNumber + 1).length() > 6) {
> >> throw new IOException("Task id too large.");
> >> }
> >> TaskAttemptID taskAttemptID = TaskAttemptID.forName("attempt__0000_r_"
> >> + String.format("%" + (6 - Integer.toString(taskNumber + 1).length()) +
> >> "s"," ").replace(" ", "0")
> >> + Integer.toString(taskNumber + 1)
> >> + "_0");
> >>  this.configuration.set("mapred.task.id", taskAttemptID.toString());
> >> this.configuration.setInt("mapred.task.partition", taskNumber + 1);
> >> // for hadoop 2.2
> >> this.configuration.set("mapreduce.task.attempt.id",
> >> taskAttemptID.toString());
> >> this.configuration.setInt("mapreduce.task.partition", taskNumber + 1);
> >>  try {
> >> this.context =
> >> HadoopUtils.instantiateTaskAttemptContext(this.configuration,
> >> taskAttemptID);
> >> } catch (Exception e) {
> >> throw new RuntimeException(e);
> >> }
> >> final HFileOutputFormat2 outFormat = new HFileOutputFormat2();
> >> try {
> >> this.writer = outFormat.getRecordWriter(this.context);
> >> } catch (InterruptedException iex) {
> >> throw new IOException("Opening the writer was interrupted.", iex);
> >> }
> >> }
> >>
> >> But I'm not sure about how to pass the JobConf to the class, if to merge
> >> config fileas, where HFileOutputFormat2 writes the data and how to
> >> implement the public void writeRecord(Record record) API.
> >> Could I do a little chat off the mailing list with the implementor of
> >> this extension?
> >>
> >> On Mon, Nov 3, 2014 at 11:51 AM, Fabian Hueske <[email protected]>
> >> wrote:
> >>
> >>> Hi Flavio
> >>>
> >>> let me try to answer your last question on the user's list (to the best
> >>> of
> >>> my HBase knowledge).
> >>> "I just wanted to known if and how regiom splitting is handled. Can you
> >>> explain me in detail how Flink and HBase works?what is not fully clear
> to
> >>> me is when computation is done by region servers and when data start
> flow
> >>> to a Flink worker (that in ky test job is only my pc) and how ro
> >>> undertsand
> >>> better the important logged info to understand if my job is performing
> >>> well"
> >>>
> >>> HBase partitions its tables into so called "regions" of keys and stores
> >>> the
> >>> regions distributed in the cluster using HDFS. I think an HBase region
> >>> can
> >>> be thought of as a HDFS block. To make reading an HBase table
> efficient,
> >>> region reads should be locally done, i.e., an InputFormat should
> >>> primarily
> >>> read region that are stored on the same machine as the IF is running
> on.
> >>> Flink's InputSplits partition the HBase input by regions and add
> >>> information about the storage location of the region. During execution,
> >>> input splits are assigned to InputFormats that can do local reads.
> >>>
> >>> Best, Fabian
> >>>
> >>> 2014-11-03 11:13 GMT+01:00 Stephan Ewen <[email protected]>:
> >>>
> >>> > Hi!
> >>> >
> >>> > The way of passing parameters through the configuration is very old
> >>> (the
> >>> > original HBase format dated back to that time). I would simply make
> the
> >>> > HBase format take those parameters through the constructor.
> >>> >
> >>> > Greetings,
> >>> > Stephan
> >>> >
> >>> >
> >>> > On Mon, Nov 3, 2014 at 10:59 AM, Flavio Pompermaier <
> >>> [email protected]>
> >>> > wrote:
> >>> >
> >>> > > The problem is that I also removed the GenericTableOutputFormat
> >>> because
> >>> > > there is an incompatibility between hadoop1 and hadoop2 for class
> >>> > > TaskAttemptContext and TaskAttemptContextImpl..
> >>> > > then it would be nice if the user doesn't have to worry about
> passing
> >>> > > pact.hbase.jtkey and pact.job.id parameters..
> >>> > > I think it is probably a good idea to remove hadoop1 compatibility
> >>> and
> >>> > keep
> >>> > > enable HBase addon only for hadoop2 (as before) and decide how to
> >>> mange
> >>> > > those 2 parameters..
> >>> > >
> >>> > > On Mon, Nov 3, 2014 at 10:19 AM, Stephan Ewen <[email protected]>
> >>> wrote:
> >>> > >
> >>> > > > It is fine to remove it, in my opinion.
> >>> > > >
> >>> > > > On Mon, Nov 3, 2014 at 10:11 AM, Flavio Pompermaier <
> >>> > > [email protected]>
> >>> > > > wrote:
> >>> > > >
> >>> > > > > That is one class I removed because it was using the deprecated
> >>> API
> >>> > > > > GenericDataSink..I can restore them but the it will be a good
> >>> idea to
> >>> > > > > remove those warning (also because from what I understood the
> >>> Record
> >>> > > APIs
> >>> > > > > are going to be removed).
> >>> > > > >
> >>> > > > > On Mon, Nov 3, 2014 at 9:51 AM, Fabian Hueske <
> >>> [email protected]>
> >>> > > > wrote:
> >>> > > > >
> >>> > > > > > I'm not familiar with the HBase connector code, but are you
> >>> maybe
> >>> > > > looking
> >>> > > > > > for the GenericTableOutputFormat?
> >>> > > > > >
> >>> > > > > > 2014-11-03 9:44 GMT+01:00 Flavio Pompermaier <
> >>> [email protected]
> >>> > >:
> >>> > > > > >
> >>> > > > > > > | was trying to modify the example setting
> hbaseDs.output(new
> >>> > > > > > > HBaseOutputFormat()); but I can't see any HBaseOutputFormat
> >>> > > > > class..maybe
> >>> > > > > > we
> >>> > > > > > > shall use another class?
> >>> > > > > > >
> >>> > > > > > > On Mon, Nov 3, 2014 at 9:39 AM, Flavio Pompermaier <
> >>> > > > > [email protected]
> >>> > > > > > >
> >>> > > > > > > wrote:
> >>> > > > > > >
> >>> > > > > > > > Maybe that's something I could add to the HBase example
> and
> >>> > that
> >>> > > > > could
> >>> > > > > > be
> >>> > > > > > > > better documented in the Wiki.
> >>> > > > > > > >
> >>> > > > > > > > Since we're talking about the wiki..I was looking at the
> >>> Java
> >>> > > API (
> >>> > > > > > > >
> >>> > > > > > >
> >>> > > > > >
> >>> > > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> http://flink.incubator.apache.org/docs/0.6-incubating/java_api_guide.html
> >>> )
> >>> > > > > > > > and the link to the KMeans example is not working (where
> it
> >>> > says
> >>> > > > For
> >>> > > > > a
> >>> > > > > > > > complete example program, have a look at KMeans
> Algorithm).
> >>> > > > > > > >
> >>> > > > > > > > Best,
> >>> > > > > > > > Flavio
> >>> > > > > > > >
> >>> > > > > > > >
> >>> > > > > > > > On Mon, Nov 3, 2014 at 9:12 AM, Flavio Pompermaier <
> >>> > > > > > [email protected]
> >>> > > > > > > >
> >>> > > > > > > > wrote:
> >>> > > > > > > >
> >>> > > > > > > >> Ah ok, perfect! That was the reason why I removed it :)
> >>> > > > > > > >>
> >>> > > > > > > >> On Mon, Nov 3, 2014 at 9:10 AM, Stephan Ewen <
> >>> > [email protected]>
> >>> > > > > > wrote:
> >>> > > > > > > >>
> >>> > > > > > > >>> You do not really need a HBase data sink. You can call
> >>> > > > > > > >>> "DataSet.output(new
> >>> > > > > > > >>> HBaseOutputFormat())"
> >>> > > > > > > >>>
> >>> > > > > > > >>> Stephan
> >>> > > > > > > >>> Am 02.11.2014 23:05 schrieb "Flavio Pompermaier" <
> >>> > > > > > [email protected]
> >>> > > > > > > >:
> >>> > > > > > > >>>
> >>> > > > > > > >>> > Just one last thing..I removed the HbaseDataSink
> >>> because I
> >>> > > > think
> >>> > > > > it
> >>> > > > > > > was
> >>> > > > > > > >>> > using the old APIs..can someone help me in updating
> >>> that
> >>> > > class?
> >>> > > > > > > >>> >
> >>> > > > > > > >>> > On Sun, Nov 2, 2014 at 10:55 AM, Flavio Pompermaier <
> >>> > > > > > > >>> [email protected]>
> >>> > > > > > > >>> > wrote:
> >>> > > > > > > >>> >
> >>> > > > > > > >>> > > Indeed this time the build has been successful :)
> >>> > > > > > > >>> > >
> >>> > > > > > > >>> > > On Sun, Nov 2, 2014 at 10:29 AM, Fabian Hueske <
> >>> > > > > > [email protected]
> >>> > > > > > > >
> >>> > > > > > > >>> > wrote:
> >>> > > > > > > >>> > >
> >>> > > > > > > >>> > >> You can also setup Travis to build your own Github
> >>> > > > > repositories
> >>> > > > > > by
> >>> > > > > > > >>> > linking
> >>> > > > > > > >>> > >> it to your Github account. That way Travis can
> >>> build all
> >>> > > > your
> >>> > > > > > > >>> branches
> >>> > > > > > > >>> > >> (and
> >>> > > > > > > >>> > >> you can also trigger rebuilds if something fails).
> >>> > > > > > > >>> > >> Not sure if we can manually trigger retrigger
> >>> builds on
> >>> > > the
> >>> > > > > > Apache
> >>> > > > > > > >>> > >> repository.
> >>> > > > > > > >>> > >>
> >>> > > > > > > >>> > >> Support for Hadoop 1 and 2 is indeed a very good
> >>> > addition
> >>> > > > :-)
> >>> > > > > > > >>> > >>
> >>> > > > > > > >>> > >> For the discusion about the PR itself, I would
> need
> >>> a
> >>> > bit
> >>> > > > more
> >>> > > > > > > time
> >>> > > > > > > >>> to
> >>> > > > > > > >>> > >> become more familiar with HBase. I do also not
> have
> >>> a
> >>> > > HBase
> >>> > > > > > setup
> >>> > > > > > > >>> > >> available
> >>> > > > > > > >>> > >> here.
> >>> > > > > > > >>> > >> Maybe somebody else of the community who was
> >>> involved
> >>> > > with a
> >>> > > > > > > >>> previous
> >>> > > > > > > >>> > >> version of the HBase connector could comment on
> your
> >>> > > > question.
> >>> > > > > > > >>> > >>
> >>> > > > > > > >>> > >> Best, Fabian
> >>> > > > > > > >>> > >>
> >>> > > > > > > >>> > >> 2014-11-02 9:57 GMT+01:00 Flavio Pompermaier <
> >>> > > > > > > [email protected]
> >>> > > > > > > >>> >:
> >>> > > > > > > >>> > >>
> >>> > > > > > > >>> > >> > As suggestes by Fabian I moved the discussion on
> >>> this
> >>> > > > > mailing
> >>> > > > > > > >>> list.
> >>> > > > > > > >>> > >> >
> >>> > > > > > > >>> > >> > I think that what is still to be discussed is
> >>> how  to
> >>> > > > > > retrigger
> >>> > > > > > > >>> the
> >>> > > > > > > >>> > >> build
> >>> > > > > > > >>> > >> > on Travis (I don't have an account) and if the
> PR
> >>> can
> >>> > be
> >>> > > > > > > >>> integrated.
> >>> > > > > > > >>> > >> >
> >>> > > > > > > >>> > >> > Maybe what I can do is to move the HBase example
> >>> in
> >>> > the
> >>> > > > test
> >>> > > > > > > >>> package
> >>> > > > > > > >>> > >> (right
> >>> > > > > > > >>> > >> > now I left it in the main folder) so it will
> force
> >>> > > Travis
> >>> > > > to
> >>> > > > > > > >>> rebuild.
> >>> > > > > > > >>> > >> > I'll do it within a couple of hours.
> >>> > > > > > > >>> > >> >
> >>> > > > > > > >>> > >> > Another thing I forgot to say is that the hbase
> >>> > > extension
> >>> > > > is
> >>> > > > > > now
> >>> > > > > > > >>> > >> compatible
> >>> > > > > > > >>> > >> > with both hadoop 1 and 2.
> >>> > > > > > > >>> > >> >
> >>> > > > > > > >>> > >> > Best,
> >>> > > > > > > >>> > >> > Flavio
> >>> > > > > > > >>> > >>
> >>> > > > > > > >>> > >
> >>> > > > > > > >>> >
> >>> > > > > > > >>>
> >>> > > > > > > >>
> >>> > > > > > > >
> >>> > > > > > >
> >>> > > > > >
> >>> > > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> >>
> >>
> >>
> >
>

Re: HBase 0.98 addon for Flink 0.8

Reply via email to