Hi Jinal, Would it be possible to dump the dot file of the job plan in both situations (i.e. where it runs as a single job and as multiple jobs) and post them somewhere?
You can get the job's dot file by calling pipeline.getConfiguration().get(PlanningParameters.PIPELINE_PLAN_DOTFILE), or it's also accessible via a PipelineExecution if you're doing asynchronous execution. - Gabriel On Fri, May 2, 2014 at 10:32 PM, Jinal Shah <[email protected]> wrote: > Gabriel, > > This is what is happening here > > PTable<ImmutableBytesWritable, Result> rawResults = > pipeline.read(newHBaseSourceTarget(table, scans[0])); > > and then a parallelDo to convert from rawResults to PTable<Key, Value> > that's it. > > > For the other question we are doing a implicit sorting by date during the > splitting in the map function itself. If that's what you are looking for > other than that no. But we are not doing any library calls for this sort. > This is what I understood so if I'm wrong let me know. > > > Thanks > > Jinal > > > On Fri, May 2, 2014 at 3:10 PM, Gabriel Reid <[email protected]> wrote: > >> Hi Jinal, >> >> Could you give a bit more info on what exactly is going on in the call >> to convertingDataIntoPTable? Is a Sort library call used anywhere in >> your pipeline? >> >> - Gabriel >> >> On Fri, May 2, 2014 at 8:48 PM, Jinal Shah <[email protected]> >> wrote: >> > Hi, >> > >> > I'm working on something and there is a strange behavior that is >> happening >> > and not sure if it should happen that way or not. Here is what I'm doing >> > >> > There are 2 HBase tables I'm scanning and both are going through the same >> > process. 1 tables has less data and the other has comparatively large >> > amount of data and here is the process >> > >> > *--- Start Processing* >> > >> > Scan[] data = comingFromHbase(); >> > >> > //Now I'm converting this into a PTable >> > PTable<Key, Value> table = convertingDataIntoPTable(data); >> > >> > table.cache(); //Expecting this to cache and not reading it again from >> > HBase for further processing >> > >> > //Now I'm splitting the data into 2 parts >> > PCollection<Pair<PTable<Key,Value>,PTable<Key,Value>> split = >> > someFunctionDoingTheSplitOnTableData(table); >> > >> > //There are further operation on the split >> > >> > *--- END Processing* >> > >> > Now for the above process the expected behavior is that if it has to read >> > it will read it from the cache as oppose to reading from HBase directly >> if >> > processes work in parallel. >> > >> > This behavior is seen for the table with less data and the whole process >> > works as expected but for the large table it is starting two different >> jobs >> > in parallel and reading from Hbase twice separately for each job and >> doing >> > the processing separately. >> > >> > Any idea why this is happening? >> > >> > Thanks >> > Jinal >>
