Re: HBaseContext with Spark

2017-01-27 Thread Chetan Khatri
storage handler bulk load:

SET hive.hbase.bulk=true;
INSERT OVERWRITE TABLE users SELECT … ;
But for now, you have to do some work and issue multiple Hive commands
Sample source data for range partitioning
Save sampling results to a file
Run CLUSTER BY query using HiveHFileOutputFormat and TotalOrderPartitioner
(sorts data, producing a large number of region files)
Import HFiles into HBase
HBase can merge files if necessary

On Sat, Jan 28, 2017 at 11:32 AM, Chetan Khatri  wrote:

> @Ted, I dont think so.
>
> On Thu, Jan 26, 2017 at 6:35 AM, Ted Yu  wrote:
>
>> Does the storage handler provide bulk load capability ?
>>
>> Cheers
>>
>> On Jan 25, 2017, at 3:39 AM, Amrit Jangid 
>> wrote:
>>
>> Hi chetan,
>>
>> If you just need HBase Data into Hive, You can use Hive EXTERNAL TABLE
>> with
>>
>> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'.
>>
>>
>> Try this if you problem can be solved
>>
>>
>> https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration
>>
>>
>> Regards
>>
>> Amrit
>>
>>
>> .
>>
>> On Wed, Jan 25, 2017 at 5:02 PM, Chetan Khatri <
>> chetan.opensou...@gmail.com> wrote:
>>
>>> Hello Spark Community Folks,
>>>
>>> Currently I am using HBase 1.2.4 and Hive 1.2.1, I am looking for Bulk
>>> Load from Hbase to Hive.
>>>
>>> I have seen couple of good example at HBase Github Repo:
>>> https://github.com/apache/hbase/tree/master/hbase-spark
>>>
>>> If I would like to use HBaseContext with HBase 1.2.4, how it can be done
>>> ? Or which version of HBase has more stability with HBaseContext ?
>>>
>>> Thanks.
>>>
>>
>>
>>
>>
>>
>


Re: HBaseContext with Spark

2017-01-27 Thread Chetan Khatri
@Ted, I dont think so.

On Thu, Jan 26, 2017 at 6:35 AM, Ted Yu  wrote:

> Does the storage handler provide bulk load capability ?
>
> Cheers
>
> On Jan 25, 2017, at 3:39 AM, Amrit Jangid 
> wrote:
>
> Hi chetan,
>
> If you just need HBase Data into Hive, You can use Hive EXTERNAL TABLE
> with
>
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'.
>
>
> Try this if you problem can be solved
>
>
> https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration
>
>
> Regards
>
> Amrit
>
>
> .
>
> On Wed, Jan 25, 2017 at 5:02 PM, Chetan Khatri <
> chetan.opensou...@gmail.com> wrote:
>
>> Hello Spark Community Folks,
>>
>> Currently I am using HBase 1.2.4 and Hive 1.2.1, I am looking for Bulk
>> Load from Hbase to Hive.
>>
>> I have seen couple of good example at HBase Github Repo:
>> https://github.com/apache/hbase/tree/master/hbase-spark
>>
>> If I would like to use HBaseContext with HBase 1.2.4, how it can be done
>> ? Or which version of HBase has more stability with HBaseContext ?
>>
>> Thanks.
>>
>
>
>
>
>


CFP for Spark Summit San Francisco closes on Feb. 6

2017-01-27 Thread Scott walent
In June, the 10th Spark Summit will take place in San Francisco at Moscone
West. We have expanded our CFP to include more topics and deep-dive
technical sessions.

Take center stage in front of your fellow Spark enthusiasts. Submit your
presentation and join us for the big ten. The CFP closes on February 6th!

Submit your abstracts at https://spark-summit.org/2017


Re: Issue creating row with java.util.Map type

2017-01-27 Thread Richard Xin
try
Row newRow = RowFactory.create(row.getString(0), row.getString(1), 
row.getMap(2)); 

On Friday, January 27, 2017 10:52 AM, Ankur Srivastava 
 wrote:
 

 + DEV Mailing List
On Thu, Jan 26, 2017 at 5:12 PM, Ankur Srivastava  
wrote:

Hi,
I am trying to map a Dataset with rows which have a map attribute. When I try 
to create a Row with the map attribute I get cast errors. I am able to 
reproduce the issue with the below sample code. The surprising thing is with 
same schema I am able to create a dataset from the List of rows.
I am on Spark 2.0 and scala 2.11public static void main(String[] args) {
StructType schema = new StructType().add("src", DataTypes.StringType)
.add("dst", DataTypes.StringType)
.add("freq", DataTypes.createMapType( DataTypes.StringType, 
DataTypes.IntegerType));
List inputData = new ArrayList<>();
inputData.add(RowFactory.creat e("1", "2", new HashMap<>()));
SparkSession sparkSession = SparkSession
.builder()
.appName("IPCountFilterTest")
.master("local")
.getOrCreate();

Dataset out = sparkSession.createDataFrame( inputData, schema);
out.show();

Encoder rowEncoder = RowEncoder.apply(schema);
out.map((MapFunction) row -> {
Row newRow = RowFactory.create(row. getString(0), row.getString(1), new 
HashMap());   //Row newRow = RowFactory.create(row. 
getString(0), row.getString(1), row.getJavaMap(2));return newRow;
}, rowEncoder).show();
}
Below is the error:
17/01/26 17:05:30 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 
0)java.lang.RuntimeException: java.util.HashMap is not a valid external type 
for schema of map at org.apache.spark.sql.catalyst. 
expressions.GeneratedClass$ GeneratedIterator.processNext( Unknown Source) at 
org.apache.spark.sql. execution.BufferedRowIterator. 
hasNext(BufferedRowIterator. java:43) at org.apache.spark.sql. execution. 
WholeStageCodegenExec$$ anonfun$8$$anon$1.hasNext( WholeStageCodegenExec.scala: 
370) at org.apache.spark.sql. execution.SparkPlan$$anonfun$ 
4.apply(SparkPlan.scala:246) at org.apache.spark.sql. 
execution.SparkPlan$$anonfun$ 4.apply(SparkPlan.scala:240) at 
org.apache.spark.rdd.RDD$$ anonfun$mapPartitionsInternal$ 
1$$anonfun$apply$24.apply(RDD. scala:784) at org.apache.spark.rdd.RDD$$ 
anonfun$mapPartitionsInternal$ 1$$anonfun$apply$24.apply(RDD. scala:784) at 
org.apache.spark.rdd. MapPartitionsRDD.compute( MapPartitionsRDD.scala:38) at 
org.apache.spark.rdd.RDD. computeOrReadCheckpoint(RDD. scala:319) at 
org.apache.spark.rdd.RDD. iterator(RDD.scala:283) at 
org.apache.spark.scheduler. ResultTask.runTask(ResultTask. scala:70) at 
org.apache.spark.scheduler. Task.run(Task.scala:85) at 
org.apache.spark.executor. Executor$TaskRunner.run( Executor.scala:274) at 
java.util.concurrent. ThreadPoolExecutor.runWorker( 
ThreadPoolExecutor.java:1142) at java.util.concurrent. 
ThreadPoolExecutor$Worker.run( ThreadPoolExecutor.java:617) at 
java.lang.Thread.run(Thread. java:745)17/01/26 17:05:30 WARN TaskSetManager: 
Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.RuntimeException: 
java.util.HashMap is not a valid external type for schema of map at 
org.apache.spark.sql.catalyst. expressions.GeneratedClass$ 
GeneratedIterator.processNext( Unknown Source) at org.apache.spark.sql. 
execution.BufferedRowIterator. hasNext(BufferedRowIterator. java:43) at 
org.apache.spark.sql. execution. WholeStageCodegenExec$$ 
anonfun$8$$anon$1.hasNext( WholeStageCodegenExec.scala: 370) at 
org.apache.spark.sql. execution.SparkPlan$$anonfun$ 
4.apply(SparkPlan.scala:246) at org.apache.spark.sql. 
execution.SparkPlan$$anonfun$ 4.apply(SparkPlan.scala:240) at 
org.apache.spark.rdd.RDD$$ anonfun$mapPartitionsInternal$ 
1$$anonfun$apply$24.apply(RDD. scala:784) at org.apache.spark.rdd.RDD$$ 
anonfun$mapPartitionsInternal$ 1$$anonfun$apply$24.apply(RDD. scala:784) at 
org.apache.spark.rdd. MapPartitionsRDD.compute( MapPartitionsRDD.scala:38) at 
org.apache.spark.rdd.RDD. computeOrReadCheckpoint(RDD. scala:319) at 
org.apache.spark.rdd.RDD. iterator(RDD.scala:283) at 
org.apache.spark.scheduler. ResultTask.runTask(ResultTask. scala:70) at 
org.apache.spark.scheduler. Task.run(Task.scala:85) at 
org.apache.spark.executor. Executor$TaskRunner.run( Executor.scala:274) at 
java.util.concurrent. ThreadPoolExecutor.runWorker( 
ThreadPoolExecutor.java:1142) at java.util.concurrent. 
ThreadPoolExecutor$Worker.run( ThreadPoolExecutor.java:617) at 
java.lang.Thread.run(Thread. java:745)

ThanksAnkur