Hi, I'm working on something and there is a strange behavior that is happening and not sure if it should happen that way or not. Here is what I'm doing
There are 2 HBase tables I'm scanning and both are going through the same process. 1 tables has less data and the other has comparatively large amount of data and here is the process *--- Start Processing* Scan[] data = comingFromHbase(); //Now I'm converting this into a PTable PTable<Key, Value> table = convertingDataIntoPTable(data); table.cache(); //Expecting this to cache and not reading it again from HBase for further processing //Now I'm splitting the data into 2 parts PCollection<Pair<PTable<Key,Value>,PTable<Key,Value>> split = someFunctionDoingTheSplitOnTableData(table); //There are further operation on the split *--- END Processing* Now for the above process the expected behavior is that if it has to read it will read it from the cache as oppose to reading from HBase directly if processes work in parallel. This behavior is seen for the table with less data and the whole process works as expected but for the large table it is starting two different jobs in parallel and reading from Hbase twice separately for each job and doing the processing separately. Any idea why this is happening? Thanks Jinal
