[jira] [Created] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark

2016-12-30 Thread Xuefu Zhang (JIRA)
Xuefu Zhang created HIVE-15527:
--

 Summary: Memory usage is unbound in SortByShuffler for Spark
 Key: HIVE-15527
 URL: https://issues.apache.org/jira/browse/HIVE-15527
 Project: Hive
  Issue Type: Improvement
  Components: Spark
Affects Versions: 1.1.0
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang


In SortByShuffler.java, an ArrayList is used to back the iterator for values 
that have the same key in shuffled result produced by spark transformation 
sortByKey. It's possible that memory can be exhausted because of a large key 
group.

{code}
@Override
public Tuple2 next() {
  // TODO: implement this by accumulating rows with the same key 
into a list.
  // Note that this list needs to improved to prevent excessive 
memory usage, but this
  // can be done in later phase.
  while (it.hasNext()) {
Tuple2 pair = it.next();
if (curKey != null && !curKey.equals(pair._1())) {
  HiveKey key = curKey;
  List values = curValues;
  curKey = pair._1();
  curValues = new ArrayList();
  curValues.add(pair._2());
  return new Tuple2(key, 
values);
}
curKey = pair._1();
curValues.add(pair._2());
  }
  if (curKey == null) {
throw new NoSuchElementException();
  }
  // if we get here, this should be the last element we have
  HiveKey key = curKey;
  curKey = null;
  return new Tuple2(key, 
curValues);
}
{code}

Since the output from sortByKey is already sorted on key, it's possible to 
backup the value iterable using the input iterator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Bitmap index implementation in Hive

2016-12-30 Thread poornima kumarasinghe
Hi all,

I am researching on the effect of bitmap indexes on queries and I happened
to come across bitmap indexes on Hive. But I have a few areas that I need
help in understanding how the bitmaps are implemented and used in hive.

1) The bitmap column of the index table is always the same value
([1,2,4,8589934592,1,0]). How can the bitmap of two different values be the
same?

2) The bitmap index used query and the one that didn't use bitmap indexes
gave the following results for their explain statement. The blue colored
sections are the same steps. According to this explain statements, the
bitmap index usage part seems redundant. Can you please help me to clear
out these issues?

Regards,
Samadhi Poornima.

hive> EXPLAIN SELECT count(*) FROM truck_mileage_stage WHERE truckid =
"A30" AND driverid="A30";
OK
STAGE DEPENDENCIES:
  Stage-7 is a root stage
  Stage-4 depends on stages: Stage-7
  Stage-2 depends on stages: Stage-4
  Stage-1 depends on stages: Stage-2
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-7
Map Reduce Local Work
  Alias -> Map Local Tables:
$hdt$_0:default__truck_mileage_stage_bittruck_index__
  Fetch Operator
limit: -1
  Alias -> Map Local Operator Tree:
$hdt$_0:default__truck_mileage_stage_bittruck_index__
  TableScan
alias: default__truck_mileage_stage_bittruck_index__
filterExpr: ((truckid = 'A30') and _bucketname is not null and
_offset is not null) (type: boolean)
Filter Operator
  predicate: ((truckid = 'A30') and _bucketname is not null and
_offset is not null) (type: boolean)
  Select Operator
expressions: _bucketname (type: string), _offset (type:
bigint), _bitmaps (type: array)
outputColumnNames: _col0, _col1, _col2
HashTable Sink Operator
  keys:
0 _col0 (type: string), _col1 (type: bigint)
1 _col0 (type: string), _col1 (type: bigint)

  Stage: Stage-4
Map Reduce
  Map Operator Tree:
  TableScan
alias: default__truck_mileage_stage_bitdriver_index__
filterExpr: ((driverid = 'A30') and _bucketname is not null and
_offset is not null) (type: boolean)
Filter Operator
  predicate: ((driverid = 'A30') and _bucketname is not null
and _offset is not null) (type: boolean)
  Select Operator
expressions: _bucketname (type: string), _offset (type:
bigint), _bitmaps (type: array)
outputColumnNames: _col0, _col1, _col2
Map Join Operator
  condition map:
   Inner Join 0 to 1
  keys:
0 _col0 (type: string), _col1 (type: bigint)
1 _col0 (type: string), _col1 (type: bigint)
  outputColumnNames: _col0, _col2, _col4, _col5
  Filter Operator
predicate: (not
EWAH_BITMAP_EMPTY(EWAH_BITMAP_AND(_col2,_col5))) (type: boolean)
Select Operator
  expressions: _col0 (type: string), _col4 (type:
bigint)
  outputColumnNames: _col0, _col4
  Group By Operator
aggregations: collect_set(_col4)
keys: _col0 (type: string)
mode: hash
outputColumnNames: _col0, _col1
Reduce Output Operator
  key expressions: _col0 (type: string)
  sort order: +
  Map-reduce partition columns: _col0 (type: string)
  value expressions: _col1 (type: array)
  Local Work:
Map Reduce Local Work
  Reduce Operator Tree:
Group By Operator
  aggregations: collect_set(VALUE._col0)
  keys: KEY._col0 (type: string)
  mode: mergepartial
  outputColumnNames: _col0, _col1
  File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-2
Move Operator
  files:
  hdfs directory: true
  destination:
file:/tmp/samadhik/392bef0d-0e26-448f-bdce-7621965b1ad0/hive_2016-12-30_09-56-12_455_2464238236026798959-1/-mr-10004

  Stage: Stage-1
Map Reduce
  Map Operator Tree:
  TableScan
alias: truck_mileage_stage
filterExpr: ((truckid = 'A30') and (driverid = 'A30')) (type:
boolean)
Statistics: Num rows: 1148 Data size: 229744 Basic stats:
COMPLETE Column stats: NONE
Filter Operator
  predicate: ((truckid = 'A30') and (driverid = 'A30')) (type:
boolean)