Does the HBase jar in the lib folder contain a config that could be used instead of the config in the job jar file? Or is simply no config at all available when the configure method is called?
-- Fabian Hueske Phone: +49 170 5549438 Email: [email protected] Web: http://www.user.tu-berlin.de/fabian.hueske From: Flavio Pompermaier Sent: Thursday, 13. November, 2014 21:43 To: [email protected] The hbase jar is in the lib directory on each node while the config files are within the jar file I submit from the web client. On Nov 13, 2014 9:37 PM, <[email protected]> wrote: > Have you added the hbase.jar file with your HBase config to the ./lib > folders of your Flink setup (JobManager, TaskManager) or is it bundled with > your job.jar file? > > > > > > -- > Fabian Hueske > Phone: +49 170 5549438 > Email: [email protected] > Web: http://www.user.tu-berlin.de/fabian.hueske > > > > > > From: Flavio Pompermaier > Sent: Thursday, 13. November, 2014 18:36 > To: [email protected] > > > > > > Any help with this? :( > > On Thu, Nov 13, 2014 at 2:06 PM, Flavio Pompermaier <[email protected]> > wrote: > > > We definitely discovered that instantiating HTable and Scan in > configure() > > method of TableInputFormat causes problem in distributed environment! > > If you look at my implementation at > > > https://github.com/fpompermaier/incubator-flink/blob/master/flink-addons/flink-hbase/src/main/java/org/apache/flink/addons/hbase/TableInputFormat.java > > you can see that Scan and HTable were made transient and recreated within > > configure but this causes HBaseConfiguration.create() to fail searching > for > > classpath files...could you help us understanding why? > > > > On Wed, Nov 12, 2014 at 8:10 PM, Flavio Pompermaier < > [email protected]> > > wrote: > > > >> Usually, when I run a mapreduce job both on Spark and Hadoop I just put > >> *-site.xml files into the war I submit to the cluster and that's it. I > >> think the problem appeared when I made the HTable a private transient > field > >> and the table istantiation was moved in the configure method. > >> Could it be a valid reason? we still have to make a deeper debug but I'm > >> trying ro figure out where to investigate.. > >> On Nov 12, 2014 8:03 PM, "Robert Metzger" <[email protected]> wrote: > >> > >>> Hi, > >>> Maybe its an issue with the classpath? As far as I know is Hadoop > reading > >>> the configuration files from the classpath. Maybe is the hbase-site.xml > >>> file not accessible through the classpath when running on the cluster? > >>> > >>> On Wed, Nov 12, 2014 at 7:40 PM, Flavio Pompermaier < > >>> [email protected]> > >>> wrote: > >>> > >>> > Today we tried tp execute a job on the cluster instead of on local > >>> executor > >>> > and we faced the problem that the hbase-site.xml was basically > >>> ignored. Is > >>> > there a reason why the TableInputFormat is working correctly on local > >>> > environment while it doesn't on a cluster? > >>> > On Nov 10, 2014 10:56 AM, "Fabian Hueske" <[email protected]> > wrote: > >>> > > >>> > > I don't think we need to bundle the HBase input and output format > in > >>> a > >>> > > single PR. > >>> > > So, I think we can proceed with the IF only and target the OF > later. > >>> > > However, the fix for Kryo should be in the master before merging > the > >>> PR. > >>> > > Till is currently working on that and said he expects this to be > >>> done by > >>> > > end of the week. > >>> > > > >>> > > Cheers, Fabian > >>> > > > >>> > > > >>> > > 2014-11-07 12:49 GMT+01:00 Flavio Pompermaier < > [email protected] > >>> >: > >>> > > > >>> > > > I fixed also the profile for Cloudera CDH5.1.3. You can build it > >>> with > >>> > the > >>> > > > command: > >>> > > > mvn clean install -Dmaven.test.skip=true -Dhadoop.profile=2 > >>> > > > -Pvendor-repos,cdh5.1.3 > >>> > > > > >>> > > > However, it would be good to generate the specific jar when > >>> > > > releasing..(e.g. > >>> > > > flink-addons:flink-hbase:0.8.0-hadoop2-cdh5.1.3-incubating) > >>> > > > > >>> > > > Best, > >>> > > > Flavio > >>> > > > > >>> > > > On Fri, Nov 7, 2014 at 12:44 PM, Flavio Pompermaier < > >>> > > [email protected]> > >>> > > > wrote: > >>> > > > > >>> > > > > I've just updated the code on my fork (synch with current > master > >>> and > >>> > > > > applied improvements coming from comments on related PR). > >>> > > > > I still have to understand how to write results back to an > HBase > >>> > > > > Sink/OutputFormat... > >>> > > > > > >>> > > > > > >>> > > > > On Mon, Nov 3, 2014 at 12:05 PM, Flavio Pompermaier < > >>> > > > [email protected]> > >>> > > > > wrote: > >>> > > > > > >>> > > > >> Thanks for the detailed answer. So if I run a job from my > >>> machine > >>> > I'll > >>> > > > >> have to download all the scanned data in a table..right? > >>> > > > >> > >>> > > > >> Always regarding the GenericTableOutputFormat it is not clear > >>> to me > >>> > > how > >>> > > > >> to proceed.. > >>> > > > >> I saw in the hadoop compatibility addon that it is possible to > >>> have > >>> > > such > >>> > > > >> compatibility using HBaseUtils class so the open method should > >>> > become > >>> > > > >> something like: > >>> > > > >> > >>> > > > >> @Override > >>> > > > >> public void open(int taskNumber, int numTasks) throws > >>> IOException { > >>> > > > >> if (Integer.toString(taskNumber + 1).length() > 6) { > >>> > > > >> throw new IOException("Task id too large."); > >>> > > > >> } > >>> > > > >> TaskAttemptID taskAttemptID = > >>> > TaskAttemptID.forName("attempt__0000_r_" > >>> > > > >> + String.format("%" + (6 - Integer.toString(taskNumber + > >>> > 1).length()) > >>> > > + > >>> > > > >> "s"," ").replace(" ", "0") > >>> > > > >> + Integer.toString(taskNumber + 1) > >>> > > > >> + "_0"); > >>> > > > >> this.configuration.set("mapred.task.id", > >>> > taskAttemptID.toString()); > >>> > > > >> this.configuration.setInt("mapred.task.partition", taskNumber > + > >>> 1); > >>> > > > >> // for hadoop 2.2 > >>> > > > >> this.configuration.set("mapreduce.task.attempt.id", > >>> > > > >> taskAttemptID.toString()); > >>> > > > >> this.configuration.setInt("mapreduce.task.partition", > >>> taskNumber + > >>> > 1); > >>> > > > >> try { > >>> > > > >> this.context = > >>> > > > >> HadoopUtils.instantiateTaskAttemptContext(this.configuration, > >>> > > > >> taskAttemptID); > >>> > > > >> } catch (Exception e) { > >>> > > > >> throw new RuntimeException(e); > >>> > > > >> } > >>> > > > >> final HFileOutputFormat2 outFormat = new HFileOutputFormat2(); > >>> > > > >> try { > >>> > > > >> this.writer = outFormat.getRecordWriter(this.context); > >>> > > > >> } catch (InterruptedException iex) { > >>> > > > >> throw new IOException("Opening the writer was interrupted.", > >>> iex); > >>> > > > >> } > >>> > > > >> } > >>> > > > >> > >>> > > > >> But I'm not sure about how to pass the JobConf to the class, > if > >>> to > >>> > > merge > >>> > > > >> config fileas, where HFileOutputFormat2 writes the data and > how > >>> to > >>> > > > >> implement the public void writeRecord(Record record) API. > >>> > > > >> Could I do a little chat off the mailing list with the > >>> implementor > >>> > of > >>> > > > >> this extension? > >>> > > > >> > >>> > > > >> On Mon, Nov 3, 2014 at 11:51 AM, Fabian Hueske < > >>> [email protected]> > >>> > > > >> wrote: > >>> > > > >> > >>> > > > >>> Hi Flavio > >>> > > > >>> > >>> > > > >>> let me try to answer your last question on the user's list > (to > >>> the > >>> > > best > >>> > > > >>> of > >>> > > > >>> my HBase knowledge). > >>> > > > >>> "I just wanted to known if and how regiom splitting is > >>> handled. Can > >>> > > you > >>> > > > >>> explain me in detail how Flink and HBase works?what is not > >>> fully > >>> > > clear > >>> > > > to > >>> > > > >>> me is when computation is done by region servers and when > data > >>> > start > >>> > > > flow > >>> > > > >>> to a Flink worker (that in ky test job is only my pc) and how > >>> ro > >>> > > > >>> undertsand > >>> > > > >>> better the important logged info to understand if my job is > >>> > > performing > >>> > > > >>> well" > >>> > > > >>> > >>> > > > >>> HBase partitions its tables into so called "regions" of keys > >>> and > >>> > > stores > >>> > > > >>> the > >>> > > > >>> regions distributed in the cluster using HDFS. I think an > HBase > >>> > > region > >>> > > > >>> can > >>> > > > >>> be thought of as a HDFS block. To make reading an HBase table > >>> > > > efficient, > >>> > > > >>> region reads should be locally done, i.e., an InputFormat > >>> should > >>> > > > >>> primarily > >>> > > > >>> read region that are stored on the same machine as the IF is > >>> > running > >>> > > > on. > >>> > > > >>> Flink's InputSplits partition the HBase input by regions and > >>> add > >>> > > > >>> information about the storage location of the region. During > >>> > > execution, > >>> > > > >>> input splits are assigned to InputFormats that can do local > >>> reads. > >>> > > > >>> > >>> > > > >>> Best, Fabian > >>> > > > >>> > >>> > > > >>> 2014-11-03 11:13 GMT+01:00 Stephan Ewen <[email protected]>: > >>> > > > >>> > >>> > > > >>> > Hi! > >>> > > > >>> > > >>> > > > >>> > The way of passing parameters through the configuration is > >>> very > >>> > old > >>> > > > >>> (the > >>> > > > >>> > original HBase format dated back to that time). I would > >>> simply > >>> > make > >>> > > > the > >>> > > > >>> > HBase format take those parameters through the constructor. > >>> > > > >>> > > >>> > > > >>> > Greetings, > >>> > > > >>> > Stephan > >>> > > > >>> > > >>> > > > >>> > > >>> > > > >>> > On Mon, Nov 3, 2014 at 10:59 AM, Flavio Pompermaier < > >>> > > > >>> [email protected]> > >>> > > > >>> > wrote: > >>> > > > >>> > > >>> > > > >>> > > The problem is that I also removed the > >>> GenericTableOutputFormat > >>> > > > >>> because > >>> > > > >>> > > there is an incompatibility between hadoop1 and hadoop2 > for > >>> > class > >>> > > > >>> > > TaskAttemptContext and TaskAttemptContextImpl.. > >>> > > > >>> > > then it would be nice if the user doesn't have to worry > >>> about > >>> > > > passing > >>> > > > >>> > > pact.hbase.jtkey and pact.job.id parameters.. > >>> > > > >>> > > I think it is probably a good idea to remove hadoop1 > >>> > > compatibility > >>> > > > >>> and > >>> > > > >>> > keep > >>> > > > >>> > > enable HBase addon only for hadoop2 (as before) and > decide > >>> how > >>> > to > >>> > > > >>> mange > >>> > > > >>> > > those 2 parameters.. > >>> > > > >>> > > > >>> > > > >>> > > On Mon, Nov 3, 2014 at 10:19 AM, Stephan Ewen < > >>> > [email protected]> > >>> > > > >>> wrote: > >>> > > > >>> > > > >>> > > > >>> > > > It is fine to remove it, in my opinion. > >>> > > > >>> > > > > >>> > > > >>> > > > On Mon, Nov 3, 2014 at 10:11 AM, Flavio Pompermaier < > >>> > > > >>> > > [email protected]> > >>> > > > >>> > > > wrote: > >>> > > > >>> > > > > >>> > > > >>> > > > > That is one class I removed because it was using the > >>> > > deprecated > >>> > > > >>> API > >>> > > > >>> > > > > GenericDataSink..I can restore them but the it will > be > >>> a > >>> > good > >>> > > > >>> idea to > >>> > > > >>> > > > > remove those warning (also because from what I > >>> understood > >>> > the > >>> > > > >>> Record > >>> > > > >>> > > APIs > >>> > > > >>> > > > > are going to be removed). > >>> > > > >>> > > > > > >>> > > > >>> > > > > On Mon, Nov 3, 2014 at 9:51 AM, Fabian Hueske < > >>> > > > >>> [email protected]> > >>> > > > >>> > > > wrote: > >>> > > > >>> > > > > > >>> > > > >>> > > > > > I'm not familiar with the HBase connector code, but > >>> are > >>> > you > >>> > > > >>> maybe > >>> > > > >>> > > > looking > >>> > > > >>> > > > > > for the GenericTableOutputFormat? > >>> > > > >>> > > > > > > >>> > > > >>> > > > > > 2014-11-03 9:44 GMT+01:00 Flavio Pompermaier < > >>> > > > >>> [email protected] > >>> > > > >>> > >: > >>> > > > >>> > > > > > > >>> > > > >>> > > > > > > | was trying to modify the example setting > >>> > > > hbaseDs.output(new > >>> > > > >>> > > > > > > HBaseOutputFormat()); but I can't see any > >>> > > HBaseOutputFormat > >>> > > > >>> > > > > class..maybe > >>> > > > >>> > > > > > we > >>> > > > >>> > > > > > > shall use another class? > >>> > > > >>> > > > > > > > >>> > > > >>> > > > > > > On Mon, Nov 3, 2014 at 9:39 AM, Flavio > Pompermaier > >>> < > >>> > > > >>> > > > > [email protected] > >>> > > > >>> > > > > > > > >>> > > > >>> > > > > > > wrote: > >>> > > > >>> > > > > > > > >>> > > > >>> > > > > > > > Maybe that's something I could add to the HBase > >>> > example > >>> > > > and > >>> > > > >>> > that > >>> > > > >>> > > > > could > >>> > > > >>> > > > > > be > >>> > > > >>> > > > > > > > better documented in the Wiki. > >>> > > > >>> > > > > > > > > >>> > > > >>> > > > > > > > Since we're talking about the wiki..I was > >>> looking at > >>> > > the > >>> > > > >>> Java > >>> > > > >>> > > API ( > >>> > > > >>> > > > > > > > > >>> > > > >>> > > > > > > > >>> > > > >>> > > > > > > >>> > > > >>> > > > > > >>> > > > >>> > > > > >>> > > > >>> > > > >>> > > > >>> > > >>> > > > >>> > >>> > > > > >>> > > > >>> > > >>> > http://flink.incubator.apache.org/docs/0.6-incubating/java_api_guide.html > >>> > > > >>> ) > >>> > > > >>> > > > > > > > and the link to the KMeans example is not > working > >>> > > (where > >>> > > > it > >>> > > > >>> > says > >>> > > > >>> > > > For > >>> > > > >>> > > > > a > >>> > > > >>> > > > > > > > complete example program, have a look at KMeans > >>> > > > Algorithm). > >>> > > > >>> > > > > > > > > >>> > > > >>> > > > > > > > Best, > >>> > > > >>> > > > > > > > Flavio > >>> > > > >>> > > > > > > > > >>> > > > >>> > > > > > > > > >>> > > > >>> > > > > > > > On Mon, Nov 3, 2014 at 9:12 AM, Flavio > >>> Pompermaier < > >>> > > > >>> > > > > > [email protected] > >>> > > > >>> > > > > > > > > >>> > > > >>> > > > > > > > wrote: > >>> > > > >>> > > > > > > > > >>> > > > >>> > > > > > > >> Ah ok, perfect! That was the reason why I > >>> removed it > >>> > > :) > >>> > > > >>> > > > > > > >> > >>> > > > >>> > > > > > > >> On Mon, Nov 3, 2014 at 9:10 AM, Stephan Ewen < > >>> > > > >>> > [email protected]> > >>> > > > >>> > > > > > wrote: > >>> > > > >>> > > > > > > >> > >>> > > > >>> > > > > > > >>> You do not really need a HBase data sink. You > >>> can > >>> > > call > >>> > > > >>> > > > > > > >>> "DataSet.output(new > >>> > > > >>> > > > > > > >>> HBaseOutputFormat())" > >>> > > > >>> > > > > > > >>> > >>> > > > >>> > > > > > > >>> Stephan > >>> > > > >>> > > > > > > >>> Am 02.11.2014 23:05 schrieb "Flavio > >>> Pompermaier" < > >>> > > > >>> > > > > > [email protected] > >>> > > > >>> > > > > > > >: > >>> > > > >>> > > > > > > >>> > >>> > > > >>> > > > > > > >>> > Just one last thing..I removed the > >>> HbaseDataSink > >>> > > > >>> because I > >>> > > > >>> > > > think > >>> > > > >>> > > > > it > >>> > > > >>> > > > > > > was > >>> > > > >>> > > > > > > >>> > using the old APIs..can someone help me in > >>> > updating > >>> > > > >>> that > >>> > > > >>> > > class? > >>> > > > >>> > > > > > > >>> > > >>> > > > >>> > > > > > > >>> > On Sun, Nov 2, 2014 at 10:55 AM, Flavio > >>> > > Pompermaier < > >>> > > > >>> > > > > > > >>> [email protected]> > >>> > > > >>> > > > > > > >>> > wrote: > >>> > > > >>> > > > > > > >>> > > >>> > > > >>> > > > > > > >>> > > Indeed this time the build has been > >>> successful > >>> > :) > >>> > > > >>> > > > > > > >>> > > > >>> > > > >>> > > > > > > >>> > > On Sun, Nov 2, 2014 at 10:29 AM, Fabian > >>> Hueske > >>> > < > >>> > > > >>> > > > > > [email protected] > >>> > > > >>> > > > > > > > > >>> > > > >>> > > > > > > >>> > wrote: > >>> > > > >>> > > > > > > >>> > > > >>> > > > >>> > > > > > > >>> > >> You can also setup Travis to build your > >>> own > >>> > > Github > >>> > > > >>> > > > > repositories > >>> > > > >>> > > > > > by > >>> > > > >>> > > > > > > >>> > linking > >>> > > > >>> > > > > > > >>> > >> it to your Github account. That way > >>> Travis can > >>> > > > >>> build all > >>> > > > >>> > > > your > >>> > > > >>> > > > > > > >>> branches > >>> > > > >>> > > > > > > >>> > >> (and > >>> > > > >>> > > > > > > >>> > >> you can also trigger rebuilds if > something > >>> > > fails). > >>> > > > >>> > > > > > > >>> > >> Not sure if we can manually trigger > >>> retrigger > >>> > > > >>> builds on > >>> > > > >>> > > the > >>> > > > >>> > > > > > Apache > >>> > > > >>> > > > > > > >>> > >> repository. > >>> > > > >>> > > > > > > >>> > >> > >>> > > > >>> > > > > > > >>> > >> Support for Hadoop 1 and 2 is indeed a > >>> very > >>> > good > >>> > > > >>> > addition > >>> > > > >>> > > > :-) > >>> > > > >>> > > > > > > >>> > >> > >>> > > > >>> > > > > > > >>> > >> For the discusion about the PR itself, I > >>> would > >>> > > > need > >>> > > > >>> a > >>> > > > >>> > bit > >>> > > > >>> > > > more > >>> > > > >>> > > > > > > time > >>> > > > >>> > > > > > > >>> to > >>> > > > >>> > > > > > > >>> > >> become more familiar with HBase. I do > >>> also not > >>> > > > have > >>> > > > >>> a > >>> > > > >>> > > HBase > >>> > > > >>> > > > > > setup > >>> > > > >>> > > > > > > >>> > >> available > >>> > > > >>> > > > > > > >>> > >> here. > >>> > > > >>> > > > > > > >>> > >> Maybe somebody else of the community who > >>> was > >>> > > > >>> involved > >>> > > > >>> > > with a > >>> > > > >>> > > > > > > >>> previous > >>> > > > >>> > > > > > > >>> > >> version of the HBase connector could > >>> comment > >>> > on > >>> > > > your > >>> > > > >>> > > > question. > >>> > > > >>> > > > > > > >>> > >> > >>> > > > >>> > > > > > > >>> > >> Best, Fabian > >>> > > > >>> > > > > > > >>> > >> > >>> > > > >>> > > > > > > >>> > >> 2014-11-02 9:57 GMT+01:00 Flavio > >>> Pompermaier < > >>> > > > >>> > > > > > > [email protected] > >>> > > > >>> > > > > > > >>> >: > >>> > > > >>> > > > > > > >>> > >> > >>> > > > >>> > > > > > > >>> > >> > As suggestes by Fabian I moved the > >>> > discussion > >>> > > on > >>> > > > >>> this > >>> > > > >>> > > > > mailing > >>> > > > >>> > > > > > > >>> list. > >>> > > > >>> > > > > > > >>> > >> > > >>> > > > >>> > > > > > > >>> > >> > I think that what is still to be > >>> discussed > >>> > is > >>> > > > >>> how to > >>> > > > >>> > > > > > retrigger > >>> > > > >>> > > > > > > >>> the > >>> > > > >>> > > > > > > >>> > >> build > >>> > > > >>> > > > > > > >>> > >> > on Travis (I don't have an account) > and > >>> if > >>> > the > >>> > > > PR > >>> > > > >>> can > >>> > > > >>> > be > >>> > > > >>> > > > > > > >>> integrated. > >>> > > > >>> > > > > > > >>> > >> > > >>> > > > >>> > > > > > > >>> > >> > Maybe what I can do is to move the > HBase > >>> > > example > >>> > > > >>> in > >>> > > > >>> > the > >>> > > > >>> > > > test > >>> > > > >>> > > > > > > >>> package > >>> > > > >>> > > > > > > >>> > >> (right > >>> > > > >>> > > > > > > >>> > >> > now I left it in the main folder) so > it > >>> will > >>> > > > force > >>> > > > >>> > > Travis > >>> > > > >>> > > > to > >>> > > > >>> > > > > > > >>> rebuild. > >>> > > > >>> > > > > > > >>> > >> > I'll do it within a couple of hours. > >>> > > > >>> > > > > > > >>> > >> > > >>> > > > >>> > > > > > > >>> > >> > Another thing I forgot to say is that > >>> the > >>> > > hbase > >>> > > > >>> > > extension > >>> > > > >>> > > > is > >>> > > > >>> > > > > > now > >>> > > > >>> > > > > > > >>> > >> compatible > >>> > > > >>> > > > > > > >>> > >> > with both hadoop 1 and 2. > >>> > > > >>> > > > > > > >>> > >> > > >>> > > > >>> > > > > > > >>> > >> > Best, > >>> > > > >>> > > > > > > >>> > >> > Flavio > >>> > > > >>> > > > > > > >>> > >> > >>> > > > >>> > > > > > > >>> > > > >>> > > > >>> > > > > > > >>> > > >>> > > > >>> > > > > > > >>> > >>> > > > >>> > > > > > > >> > >>> > > > >>> > > > > > > > > >>> > > > >>> > > > > > > > >>> > > > >>> > > > > > > >>> > > > >>> > > > > > >>> > > > >>> > > > > >>> > > > >>> > > > >>> > > > >>> > > >>> > > > >>> > >>> > > > >> > >>> > > > >> > >>> > > > >> > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >> > > > >
