Hi Jinal, Could you give a bit more info on what exactly is going on in the call to convertingDataIntoPTable? Is a Sort library call used anywhere in your pipeline?
- Gabriel On Fri, May 2, 2014 at 8:48 PM, Jinal Shah <[email protected]> wrote: > Hi, > > I'm working on something and there is a strange behavior that is happening > and not sure if it should happen that way or not. Here is what I'm doing > > There are 2 HBase tables I'm scanning and both are going through the same > process. 1 tables has less data and the other has comparatively large > amount of data and here is the process > > *--- Start Processing* > > Scan[] data = comingFromHbase(); > > //Now I'm converting this into a PTable > PTable<Key, Value> table = convertingDataIntoPTable(data); > > table.cache(); //Expecting this to cache and not reading it again from > HBase for further processing > > //Now I'm splitting the data into 2 parts > PCollection<Pair<PTable<Key,Value>,PTable<Key,Value>> split = > someFunctionDoingTheSplitOnTableData(table); > > //There are further operation on the split > > *--- END Processing* > > Now for the above process the expected behavior is that if it has to read > it will read it from the cache as oppose to reading from HBase directly if > processes work in parallel. > > This behavior is seen for the table with less data and the whole process > works as expected but for the large table it is starting two different jobs > in parallel and reading from Hbase twice separately for each job and doing > the processing separately. > > Any idea why this is happening? > > Thanks > Jinal
