from:"\"Benjamin Kim\""

Re: Spark on Kudu

2016-09-20 Thread Benjamin Kim

I see that the API has changed a bit so my old code doesn’t work anymore. Can 
someone direct me to some code samples?

Thanks,
Ben

> On Sep 20, 2016, at 1:44 PM, Todd Lipcon  wrote:
> 
> On Tue, Sep 20, 2016 at 1:18 PM, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> Now that Kudu 1.0.0 is officially out and ready for production use, where do 
> we find the spark connector jar for this release?
> 
> 
> It's available in the official ASF maven repository:  
> https://repository.apache.org/#nexus-search;quick~kudu-spark 
> <https://repository.apache.org/#nexus-search;quick~kudu-spark>
> 
> 
>   org.apache.kudu
>   kudu-spark_2.10
>   1.0.0
> 
> 
> 
> -Todd
>  
> 
> 
>> On Jun 17, 2016, at 11:08 AM, Dan Burkert > <mailto:d...@cloudera.com>> wrote:
>> 
>> Hi Ben,
>> 
>> To your first question about `CREATE TABLE` syntax with Kudu/Spark SQL, I do 
>> not think we support that at this point.  I haven't looked deeply into it, 
>> but we may hit issues specifying Kudu-specific options (partitioning, column 
>> encoding, etc.).  Probably issues that can be worked through eventually, 
>> though.  If you are interested in contributing to Kudu, this is an area that 
>> could obviously use improvement!  Most or all of our Spark features have 
>> been completely community driven to date.
>>  
>> I am assuming that more Spark support along with semantic changes below will 
>> be incorporated into Kudu 0.9.1.
>> 
>> As a rule we do not release new features in patch releases, but the good 
>> news is that we are releasing regularly, and our next scheduled release is 
>> for the August timeframe (see JD's roadmap 
>> <https://lists.apache.org/thread.html/1a3b949e715a74d7f26bd9c102247441a06d16d077324ba39a662e2a@1455234076@%3Cdev.kudu.apache.org%3E>
>>  email about what we are aiming to include).  Also, Cloudera does publish 
>> snapshot versions of the Spark connector here 
>> <https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/>, so 
>> the jars are available if you don't mind using snapshots.
>>  
>> Anyone know of a better way to make unique primary keys other than using 
>> UUID to make every row unique if there is no unique column (or combination 
>> thereof) to use.
>> 
>> Not that I know of.  In general it's pretty rare to have a dataset without a 
>> natural primary key (even if it's just all of the columns), but in those 
>> cases UUID is a good solution.
>>  
>> This is what I am using. I know auto incrementing is coming down the line 
>> (don’t know when), but is there a way to simulate this in Kudu using Spark 
>> out of curiosity?
>> 
>> To my knowledge there is no plan to have auto increment in Kudu.  
>> Distributed, consistent, auto incrementing counters is a difficult problem, 
>> and I don't think there are any known solutions that would be fast enough 
>> for Kudu (happy to be proven wrong, though!).
>> 
>> - Dan
>>  
>> 
>> Thanks,
>> Ben
>> 
>>> On Jun 14, 2016, at 6:08 PM, Dan Burkert >> <mailto:d...@cloudera.com>> wrote:
>>> 
>>> I'm not sure exactly what the semantics will be, but at least one of them 
>>> will be upsert.  These modes come from spark, and they were really designed 
>>> for file-backed storage and not table storage.  We may want to do append = 
>>> upsert, and overwrite = truncate + insert.  I think that may match the 
>>> normal spark semantics more closely.
>>> 
>>> - Dan
>>> 
>>> On Tue, Jun 14, 2016 at 6:00 PM, Benjamin Kim >> <mailto:bbuil...@gmail.com>> wrote:
>>> Dan,
>>> 
>>> Thanks for the information. That would mean both “append” and “overwrite” 
>>> modes would be combined or not needed in the future.
>>> 
>>> Cheers,
>>> Ben
>>> 
>>>> On Jun 14, 2016, at 5:57 PM, Dan Burkert >>> <mailto:d...@cloudera.com>> wrote:
>>>> 
>>>> Right now append uses an update Kudu operation, which requires the row 
>>>> already be present in the table. Overwrite maps to insert.  Kudu very 
>>>> recently got upsert support baked in, but it hasn't yet been integrated 
>>>> into the Spark connector.  So pretty soon these sharp edges will get a lot 
>>>> better, since upsert is the way to go for most spark workloads.
>>>> 
>>>> - Dan
>>>> 
>>>> On Tue, Jun 14, 2016 at 5:41 PM, Benjamin Kim >>>

Re: Spark on Kudu

2016-09-20 Thread Benjamin Kim

Now that Kudu 1.0.0 is officially out and ready for production use, where do we 
find the spark connector jar for this release?

Thanks,
Ben

> On Jun 17, 2016, at 11:08 AM, Dan Burkert  wrote:
> 
> Hi Ben,
> 
> To your first question about `CREATE TABLE` syntax with Kudu/Spark SQL, I do 
> not think we support that at this point.  I haven't looked deeply into it, 
> but we may hit issues specifying Kudu-specific options (partitioning, column 
> encoding, etc.).  Probably issues that can be worked through eventually, 
> though.  If you are interested in contributing to Kudu, this is an area that 
> could obviously use improvement!  Most or all of our Spark features have been 
> completely community driven to date.
>  
> I am assuming that more Spark support along with semantic changes below will 
> be incorporated into Kudu 0.9.1.
> 
> As a rule we do not release new features in patch releases, but the good news 
> is that we are releasing regularly, and our next scheduled release is for the 
> August timeframe (see JD's roadmap 
> <https://lists.apache.org/thread.html/1a3b949e715a74d7f26bd9c102247441a06d16d077324ba39a662e2a@1455234076@%3Cdev.kudu.apache.org%3E>
>  email about what we are aiming to include).  Also, Cloudera does publish 
> snapshot versions of the Spark connector here 
> <https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/>, so the 
> jars are available if you don't mind using snapshots.
>  
> Anyone know of a better way to make unique primary keys other than using UUID 
> to make every row unique if there is no unique column (or combination 
> thereof) to use.
> 
> Not that I know of.  In general it's pretty rare to have a dataset without a 
> natural primary key (even if it's just all of the columns), but in those 
> cases UUID is a good solution.
>  
> This is what I am using. I know auto incrementing is coming down the line 
> (don’t know when), but is there a way to simulate this in Kudu using Spark 
> out of curiosity?
> 
> To my knowledge there is no plan to have auto increment in Kudu.  
> Distributed, consistent, auto incrementing counters is a difficult problem, 
> and I don't think there are any known solutions that would be fast enough for 
> Kudu (happy to be proven wrong, though!).
> 
> - Dan
>  
> 
> Thanks,
> Ben
> 
>> On Jun 14, 2016, at 6:08 PM, Dan Burkert > <mailto:d...@cloudera.com>> wrote:
>> 
>> I'm not sure exactly what the semantics will be, but at least one of them 
>> will be upsert.  These modes come from spark, and they were really designed 
>> for file-backed storage and not table storage.  We may want to do append = 
>> upsert, and overwrite = truncate + insert.  I think that may match the 
>> normal spark semantics more closely.
>> 
>> - Dan
>> 
>> On Tue, Jun 14, 2016 at 6:00 PM, Benjamin Kim > <mailto:bbuil...@gmail.com>> wrote:
>> Dan,
>> 
>> Thanks for the information. That would mean both “append” and “overwrite” 
>> modes would be combined or not needed in the future.
>> 
>> Cheers,
>> Ben
>> 
>>> On Jun 14, 2016, at 5:57 PM, Dan Burkert >> <mailto:d...@cloudera.com>> wrote:
>>> 
>>> Right now append uses an update Kudu operation, which requires the row 
>>> already be present in the table. Overwrite maps to insert.  Kudu very 
>>> recently got upsert support baked in, but it hasn't yet been integrated 
>>> into the Spark connector.  So pretty soon these sharp edges will get a lot 
>>> better, since upsert is the way to go for most spark workloads.
>>> 
>>> - Dan
>>> 
>>> On Tue, Jun 14, 2016 at 5:41 PM, Benjamin Kim >> <mailto:bbuil...@gmail.com>> wrote:
>>> I tried to use the “append” mode, and it worked. Over 3.8 million rows in 
>>> 64s. I would assume that now I can use the “overwrite” mode on existing 
>>> data. Now, I have to find answers to these questions. What would happen if 
>>> I “append” to the data in the Kudu table if the data already exists? What 
>>> would happen if I “overwrite” existing data when the DataFrame has data in 
>>> it that does not exist in the Kudu table? I need to evaluate the best way 
>>> to simulate the UPSERT behavior in HBase because this is what our use case 
>>> is.
>>> 
>>> Thanks,
>>> Ben
>>> 
>>> 
>>> 
>>>> On Jun 14, 2016, at 5:05 PM, Benjamin Kim >>> <mailto:bbuil...@gmail.com>> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> Now, I’m getting this err

Re: [ANNOUNCE] Apache Kudu 1.0.0 release

2016-09-20 Thread Benjamin Kim

Todd,

Thanks. I’ll look into those.

Cheers,
Ben


> On Sep 20, 2016, at 12:11 AM, Todd Lipcon  wrote:
> 
> The Apache Kudu team is happy to announce the release of Kudu 1.0.0!
> 
> Kudu is an open source storage engine for structured data which supports 
> low-latency random access together with efficient analytical access patterns. 
> It is designed within the context of the Apache Hadoop ecosystem and supports 
> many integrations with other data analytics projects both inside and outside 
> of the Apache Software Foundation.
> 
> This latest version adds several new features, including:
> 
> - Removal of multiversion concurrency control (MVCC) history is now 
> supported. This allows Kudu to reclaim disk space, where previously Kudu 
> would keep a full history of all changes made to a given table since the 
> beginning of time.
> 
> - Most of Kudu’s command line tools have been consolidated under a new 
> top-level "kudu" tool. This reduces the number of large binaries distributed 
> with Kudu and also includes much-improved help output.
> 
> - Administrative tools including "kudu cluster ksck" now support running 
> against multi-master Kudu clusters.
> 
> - The C++ client API now supports writing data in AUTO_FLUSH_BACKGROUND mode. 
> This can provide higher throughput for ingest workloads.
> 
> This release also includes many bug fixes, optimizations, and other 
> improvements, detailed in the release notes available at:
> http://kudu.apache.org/releases/1.0.0/docs/release_notes.html 
> 
> 
> Download the source release here:
> http://kudu.apache.org/releases/1.0.0/ 
> 
> 
> Convenience binary artifacts for the Java client and various Java 
> integrations (eg Spark, Flume) are also now available via the ASF Maven 
> repository.
> 
> Enjoy the new release!
> 
> - The Apache Kudu team

Re: [ANNOUNCE] Apache Kudu 1.0.0 release

2016-09-20 Thread Benjamin Kim

Todd,

Thanks. I’ll look into those.

Cheers,
Ben


> On Sep 20, 2016, at 12:11 AM, Todd Lipcon  wrote:
> 
> The Apache Kudu team is happy to announce the release of Kudu 1.0.0!
> 
> Kudu is an open source storage engine for structured data which supports 
> low-latency random access together with efficient analytical access patterns. 
> It is designed within the context of the Apache Hadoop ecosystem and supports 
> many integrations with other data analytics projects both inside and outside 
> of the Apache Software Foundation.
> 
> This latest version adds several new features, including:
> 
> - Removal of multiversion concurrency control (MVCC) history is now 
> supported. This allows Kudu to reclaim disk space, where previously Kudu 
> would keep a full history of all changes made to a given table since the 
> beginning of time.
> 
> - Most of Kudu’s command line tools have been consolidated under a new 
> top-level "kudu" tool. This reduces the number of large binaries distributed 
> with Kudu and also includes much-improved help output.
> 
> - Administrative tools including "kudu cluster ksck" now support running 
> against multi-master Kudu clusters.
> 
> - The C++ client API now supports writing data in AUTO_FLUSH_BACKGROUND mode. 
> This can provide higher throughput for ingest workloads.
> 
> This release also includes many bug fixes, optimizations, and other 
> improvements, detailed in the release notes available at:
> http://kudu.apache.org/releases/1.0.0/docs/release_notes.html 
> 
> 
> Download the source release here:
> http://kudu.apache.org/releases/1.0.0/ 
> 
> 
> Convenience binary artifacts for the Java client and various Java 
> integrations (eg Spark, Flume) are also now available via the ASF Maven 
> repository.
> 
> Enjoy the new release!
> 
> - The Apache Kudu team

Re: [ANNOUNCE] Apache Kudu 1.0.0 release

2016-09-20 Thread Benjamin Kim

This is awesome!!! Great!!!

Do you know if any improvements were also made to the Spark plugin jar?

Thanks,
Ben

> On Sep 20, 2016, at 12:11 AM, Todd Lipcon  wrote:
> 
> The Apache Kudu team is happy to announce the release of Kudu 1.0.0!
> 
> Kudu is an open source storage engine for structured data which supports 
> low-latency random access together with efficient analytical access patterns. 
> It is designed within the context of the Apache Hadoop ecosystem and supports 
> many integrations with other data analytics projects both inside and outside 
> of the Apache Software Foundation.
> 
> This latest version adds several new features, including:
> 
> - Removal of multiversion concurrency control (MVCC) history is now 
> supported. This allows Kudu to reclaim disk space, where previously Kudu 
> would keep a full history of all changes made to a given table since the 
> beginning of time.
> 
> - Most of Kudu’s command line tools have been consolidated under a new 
> top-level "kudu" tool. This reduces the number of large binaries distributed 
> with Kudu and also includes much-improved help output.
> 
> - Administrative tools including "kudu cluster ksck" now support running 
> against multi-master Kudu clusters.
> 
> - The C++ client API now supports writing data in AUTO_FLUSH_BACKGROUND mode. 
> This can provide higher throughput for ingest workloads.
> 
> This release also includes many bug fixes, optimizations, and other 
> improvements, detailed in the release notes available at:
> http://kudu.apache.org/releases/1.0.0/docs/release_notes.html 
> 
> 
> Download the source release here:
> http://kudu.apache.org/releases/1.0.0/ 
> 
> 
> Convenience binary artifacts for the Java client and various Java 
> integrations (eg Spark, Flume) are also now available via the ASF Maven 
> repository.
> 
> Enjoy the new release!
> 
> - The Apache Kudu team

Re: [ANNOUNCE] Apache Kudu 1.0.0 release

2016-09-20 Thread Benjamin Kim

This is awesome!!! Great!!!

Do you know if any improvements were also made to the Spark plugin jar?

Thanks,
Ben

> On Sep 20, 2016, at 12:11 AM, Todd Lipcon  wrote:
> 
> The Apache Kudu team is happy to announce the release of Kudu 1.0.0!
> 
> Kudu is an open source storage engine for structured data which supports 
> low-latency random access together with efficient analytical access patterns. 
> It is designed within the context of the Apache Hadoop ecosystem and supports 
> many integrations with other data analytics projects both inside and outside 
> of the Apache Software Foundation.
> 
> This latest version adds several new features, including:
> 
> - Removal of multiversion concurrency control (MVCC) history is now 
> supported. This allows Kudu to reclaim disk space, where previously Kudu 
> would keep a full history of all changes made to a given table since the 
> beginning of time.
> 
> - Most of Kudu’s command line tools have been consolidated under a new 
> top-level "kudu" tool. This reduces the number of large binaries distributed 
> with Kudu and also includes much-improved help output.
> 
> - Administrative tools including "kudu cluster ksck" now support running 
> against multi-master Kudu clusters.
> 
> - The C++ client API now supports writing data in AUTO_FLUSH_BACKGROUND mode. 
> This can provide higher throughput for ingest workloads.
> 
> This release also includes many bug fixes, optimizations, and other 
> improvements, detailed in the release notes available at:
> http://kudu.apache.org/releases/1.0.0/docs/release_notes.html 
> 
> 
> Download the source release here:
> http://kudu.apache.org/releases/1.0.0/ 
> 
> 
> Convenience binary artifacts for the Java client and various Java 
> integrations (eg Spark, Flume) are also now available via the ASF Maven 
> repository.
> 
> Enjoy the new release!
> 
> - The Apache Kudu team

Re: JDBC Very Slow

2016-09-16 Thread Benjamin Kim

I am testing this in spark-shell. I am following the Spark documentation by 
simply adding the PostgreSQL driver to the Spark Classpath.

SPARK_CLASSPATH=/path/to/postgresql/driver spark-shell

Then, I run the code below to connect to the PostgreSQL database to query. This 
is when I have problems.

Thanks,
Ben


> On Sep 16, 2016, at 3:29 PM, Nikolay Zhebet  wrote:
> 
> Hi! Can you split init code with current comand? I thing it is main problem 
> in your code.
> 
> 16 сент. 2016 г. 8:26 PM пользователь "Benjamin Kim"  <mailto:bbuil...@gmail.com>> написал:
> Has anyone using Spark 1.6.2 encountered very slow responses from pulling 
> data from PostgreSQL using JDBC? I can get to the table and see the schema, 
> but when I do a show, it takes very long or keeps timing out.
> 
> The code is simple.
> 
> val jdbcDF = sqlContext.read.format("jdbc").options(
> Map("url" -> 
> "jdbc:postgresql://dbserver:port/database?user=user&password=password",
>"dbtable" -> “schema.table")).load()
> 
> jdbcDF.show
> 
> If anyone can help, please let me know.
> 
> Thanks,
> Ben
>

JDBC Very Slow

2016-09-16 Thread Benjamin Kim

Has anyone using Spark 1.6.2 encountered very slow responses from pulling data 
from PostgreSQL using JDBC? I can get to the table and see the schema, but when 
I do a show, it takes very long or keeps timing out.

The code is simple.

val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> 
"jdbc:postgresql://dbserver:port/database?user=user&password=password",
   "dbtable" -> “schema.table")).load()

jdbcDF.show

If anyone can help, please let me know.

Thanks,
Ben

Re: Using Spark SQL to Create JDBC Tables

2016-09-13 Thread Benjamin Kim

Thank you for the idea. I will look for a PostgreSQL Serde for Hive. But, if 
you don’t mind me asking, how did you install the Oracle Serde?

Cheers,
Ben


> On Sep 13, 2016, at 7:12 PM, ayan guha  wrote:
> 
> One option is have Hive as the central point of exposing data ie create hive 
> tables which "point to" any other DB. i know Oracle provides there own Serde 
> for hive. Not sure about PG though.
> 
> Once tables are created in hive, STS will automatically see it. 
> 
> On Wed, Sep 14, 2016 at 11:08 AM, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> Has anyone created tables using Spark SQL that directly connect to a JDBC 
> data source such as PostgreSQL? I would like to use Spark SQL Thriftserver to 
> access and query remote PostgreSQL tables. In this way, we can centralize 
> data access to Spark SQL tables along with PostgreSQL making it very 
> convenient for users. They would not know or care where the data is 
> physically located anymore.
> 
> By the way, our users only know SQL.
> 
> If anyone has a better suggestion, then please let me know too.
> 
> Thanks,
> Ben
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> 
> 
> 
> 
> -- 
> Best Regards,
> Ayan Guha

Using Spark SQL to Create JDBC Tables

2016-09-13 Thread Benjamin Kim

Has anyone created tables using Spark SQL that directly connect to a JDBC data 
source such as PostgreSQL? I would like to use Spark SQL Thriftserver to access 
and query remote PostgreSQL tables. In this way, we can centralize data access 
to Spark SQL tables along with PostgreSQL making it very convenient for users. 
They would not know or care where the data is physically located anymore.

By the way, our users only know SQL.

If anyone has a better suggestion, then please let me know too.

Thanks,
Ben
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark SQL Thriftserver

2016-09-13 Thread Benjamin Kim

Mich,

It sounds like that there would be no harm in changing then. Are you saying 
that using STS would still use MapReduce to run the SQL statements? What our 
users are doing in our CDH 5.7.2 installation is changing the execution engine 
to Spark when connected to HiveServer2 to get faster results. Would they still 
have to do this using STS? Lastly, we are seeing zombie YARN jobs left behind 
even after a user disconnects. Are you seeing this happen with STS? If not, 
then this would be even better.

Thanks for your fast reply.

Cheers,
Ben

> On Sep 13, 2016, at 3:15 PM, Mich Talebzadeh  
> wrote:
> 
> Hi,
> 
> Spark Thrift server (STS) still uses hive thrift server. If you look at 
> $SPARK_HOME/sbin/start-thriftserver.sh you will see (mine is Spark 2)
> 
> function usage {
>   echo "Usage: ./sbin/start-thriftserver [options] [thrift server options]"
>   pattern="usage"
>   pattern+="\|Spark assembly has been built with Hive"
>   pattern+="\|NOTE: SPARK_PREPEND_CLASSES is set"
>   pattern+="\|Spark Command: "
>   pattern+="\|==="
>   pattern+="\|--help"
> 
> 
> Indeed when you start STS, you pass hiveconf parameter to it
> 
> ${SPARK_HOME}/sbin/start-thriftserver.sh \
> --master  \
> --hiveconf hive.server2.thrift.port=10055 \
> 
> and STS bypasses Spark optimiser and uses Hive optimizer and execution 
> engine. You will see this in hive.log file
> 
> So I don't think it is going to give you much difference. Unless they have 
> recently changed the design of STS.
> 
> HTH
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 13 September 2016 at 22:32, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> Does anyone have any thoughts about using Spark SQL Thriftserver in Spark 
> 1.6.2 instead of HiveServer2? We are considering abandoning HiveServer2 for 
> it. Some advice and gotcha’s would be nice to know.
> 
> Thanks,
> Ben
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> 
>

Spark SQL Thriftserver

2016-09-13 Thread Benjamin Kim

Does anyone have any thoughts about using Spark SQL Thriftserver in Spark 1.6.2 
instead of HiveServer2? We are considering abandoning HiveServer2 for it. Some 
advice and gotcha’s would be nice to know.

Thanks,
Ben
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark Metrics: custom source/sink configurations not getting recognized

2016-09-06 Thread Benjamin Kim

We use Graphite/Grafana for custom metrics. We found Spark’s metrics not to be 
customizable. So, we write directly using Graphite’s API, which was very easy 
to do using Java’s socket library in Scala. It works great for us, and we are 
going one step further using Sensu to alert us if there is an anomaly in the 
metrics beyond the norm.

Hope this helps.

Cheers,
Ben


> On Sep 6, 2016, at 9:52 PM, map reduced  wrote:
> 
> Hi, anyone has any ideas please?
> 
> On Mon, Sep 5, 2016 at 8:30 PM, map reduced  > wrote:
> Hi,
> 
> I've written my custom metrics source/sink for my Spark streaming app and I 
> am trying to initialize it from metrics.properties - but that doesn't work 
> from executors. I don't have control on the machines in Spark cluster, so I 
> can't copy properties file in $SPARK_HOME/conf/ in the cluster. I have it in 
> the fat jar where my app lives, but by the time my fat jar is downloaded on 
> worker nodes in cluster, executors are already started and their Metrics 
> system is already initialized - thus not picking my file with custom source 
> configuration in it.
> 
> Following this post 
> ,
>  I've specified 'spark.files 
>  = 
> metrics.properties' and 'spark.metrics.conf=metrics.properties' but by the 
> time 'metrics.properties' is shipped to executors, their metric system is 
> already initialized.
> 
> If I initialize my own metrics system, it's picking up my file but then I'm 
> missing master/executor level metrics/properties (eg. 
> executor.sink.mySink.propName=myProp - can't read 'propName' from 'mySink') 
> since they are initialized 
> 
>  by Spark's metric system.
> 
> Is there a (programmatic) way to have 'metrics.properties' shipped before 
> executors initialize 
> 
>  ?
> 
> Here's my SO question 
> .
> 
> Thanks,
> 
> KP
> 
>

Re: Spark SQL Tables on top of HBase Tables

2016-09-03 Thread Benjamin Kim

I’m using Spark 1.6 and HBase 1.2. Have you got it to work using these versions?

> On Sep 3, 2016, at 12:49 PM, Mich Talebzadeh  
> wrote:
> 
> I am trying to find a solution for this
> 
> ERROR log: error in initSerDe: java.lang.ClassNotFoundException Class 
> org.apache.hadoop.hive.hbase.HBaseSerDe not found
> 
> I am using Spark 2 and Hive 2!
> 
> HTH
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 3 September 2016 at 20:31, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> Mich,
> 
> I’m in the same boat. We can use Hive but not Spark.
> 
> Cheers,
> Ben
> 
>> On Sep 2, 2016, at 3:37 PM, Mich Talebzadeh > <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>> Hi,
>> 
>> You can create Hive external  tables on top of existing Hbase table using 
>> the property
>> 
>> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
>> 
>> Example
>> 
>> hive> show create table hbase_table;
>> OK
>> CREATE TABLE `hbase_table`(
>>   `key` int COMMENT '',
>>   `value1` string COMMENT '',
>>   `value2` int COMMENT '',
>>   `value3` int COMMENT '')
>> ROW FORMAT SERDE
>>   'org.apache.hadoop.hive.hbase.HBaseSerDe'
>> STORED BY
>>   'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
>> WITH SERDEPROPERTIES (
>>   'hbase.columns.mapping'=':key,a:b,a:c,d:e',
>>   'serialization.format'='1')
>> TBLPROPERTIES (
>>   'transient_lastDdlTime'='1472370939')
>> 
>>  Then try to access this Hive table from Spark which is giving me grief at 
>> the moment :(
>> 
>> scala> HiveContext.sql("use test")
>> res9: org.apache.spark.sql.DataFrame = []
>> scala> val hbase_table= spark.table("hbase_table")
>> 16/09/02 23:31:07 ERROR log: error in initSerDe: 
>> java.lang.ClassNotFoundException Class 
>> org.apache.hadoop.hive.hbase.HBaseSerDe not found
>> 
>> HTH
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>> On 2 September 2016 at 23:08, KhajaAsmath Mohammed > <mailto:mdkhajaasm...@gmail.com>> wrote:
>> Hi Kim,
>> 
>> I am also looking for same information. Just got the same requirement today.
>> 
>> Thanks,
>> Asmath
>> 
>> On Fri, Sep 2, 2016 at 4:46 PM, Benjamin Kim > <mailto:bbuil...@gmail.com>> wrote:
>> I was wondering if anyone has tried to create Spark SQL tables on top of 
>> HBase tables so that data in HBase can be accessed using Spark Thriftserver 
>> with SQL statements? This is similar what can be done using Hive.
>> 
>> Thanks,
>> Ben
>> 
>> 
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>> <mailto:user-unsubscr...@spark.apache.org>
>> 
>> 
>> 
> 
>

Re: Spark SQL Tables on top of HBase Tables

2016-09-03 Thread Benjamin Kim

Mich,

I’m in the same boat. We can use Hive but not Spark.

Cheers,
Ben

> On Sep 2, 2016, at 3:37 PM, Mich Talebzadeh  wrote:
> 
> Hi,
> 
> You can create Hive external  tables on top of existing Hbase table using the 
> property
> 
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> 
> Example
> 
> hive> show create table hbase_table;
> OK
> CREATE TABLE `hbase_table`(
>   `key` int COMMENT '',
>   `value1` string COMMENT '',
>   `value2` int COMMENT '',
>   `value3` int COMMENT '')
> ROW FORMAT SERDE
>   'org.apache.hadoop.hive.hbase.HBaseSerDe'
> STORED BY
>   'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES (
>   'hbase.columns.mapping'=':key,a:b,a:c,d:e',
>   'serialization.format'='1')
> TBLPROPERTIES (
>   'transient_lastDdlTime'='1472370939')
> 
>  Then try to access this Hive table from Spark which is giving me grief at 
> the moment :(
> 
> scala> HiveContext.sql("use test")
> res9: org.apache.spark.sql.DataFrame = []
> scala> val hbase_table= spark.table("hbase_table")
> 16/09/02 23:31:07 ERROR log: error in initSerDe: 
> java.lang.ClassNotFoundException Class 
> org.apache.hadoop.hive.hbase.HBaseSerDe not found
> 
> HTH
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 2 September 2016 at 23:08, KhajaAsmath Mohammed  <mailto:mdkhajaasm...@gmail.com>> wrote:
> Hi Kim,
> 
> I am also looking for same information. Just got the same requirement today.
> 
> Thanks,
> Asmath
> 
> On Fri, Sep 2, 2016 at 4:46 PM, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> I was wondering if anyone has tried to create Spark SQL tables on top of 
> HBase tables so that data in HBase can be accessed using Spark Thriftserver 
> with SQL statements? This is similar what can be done using Hive.
> 
> Thanks,
> Ben
> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> 
> 
>

Spark SQL Tables on top of HBase Tables

2016-09-02 Thread Benjamin Kim

I was wondering if anyone has tried to create Spark SQL tables on top of HBase 
tables so that data in HBase can be accessed using Spark Thriftserver with SQL 
statements? This is similar what can be done using Hive.

Thanks,
Ben


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Spark 1.6 Streaming with Checkpointing

2016-08-26 Thread Benjamin Kim

I am trying to implement checkpointing in my streaming application but I am 
getting a not serializable error. Has anyone encountered this? I am deploying 
this job in YARN clustered mode.

Here is a snippet of the main parts of the code.

object S3EventIngestion {
//create and setup streaming context
def createContext(
batchInterval: Integer, checkpointDirectory: String, awsS3BucketName: 
String, databaseName: String, tableName: String, partitionByColumnName: String
): StreamingContext = {

println("Creating new context")
val sparkConf = new SparkConf().setAppName("S3EventIngestion")
val sc = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sc)

// Create the streaming context with batch interval
val ssc = new StreamingContext(sc, Seconds(batchInterval))

// Create a text file stream on an S3 bucket
val csv = ssc.textFileStream("s3a://" + awsS3BucketName + "/")

csv.foreachRDD(rdd => {
if (!rdd.partitions.isEmpty) {
// process data
}
})

ssc.checkpoint(checkpointDirectory)
ssc
}

def main(args: Array[String]) {
if (args.length != 6) {
System.err.println("Usage: S3EventIngestion  

")
System.exit(1)
}

// Get streaming context from checkpoint data or create a new one
val context = StreamingContext.getOrCreate(checkpoint,
() => createContext(interval, checkpoint, bucket, database, table, 
partitionBy))

//start streaming context
context.start()
context.awaitTermination()
}
}

Can someone help please?

Thanks,
Ben
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: abnormal high disk I/O rate when upsert into kudu table?

2016-08-16 Thread Benjamin Kim

This could be a problem… If this is a bad byproduct brought over from HBase, 
then this is a common issue for all HBase users. It would be too bad if this 
also exists in Kudu. We HBase users have been trying to eradicate this for a 
long time.

It’s only an opinion…

Cheers,
Ben


> On Aug 16, 2016, at 6:05 PM, jacky...@gmail.com wrote:
> 
> Thanks Todd.
> 
> Kudu cluster running on centos 7.2, each tablet node has 40 cores, the test 
> table is about 140GB after 3 reps,  and partitioned by hash bucket, I had 
> tried 24 and 120 hash buckets.
> 
> I do one test: 
> 1. Stop all ingestion to the cluster
> 2. Just randomly upsert 3000 rows once, upsert contains new data row or just 
> updates to exisit row (updates the whole row, not just updates one or more 
> column)
> 3. From the CDH monitor dashboard, I see the cluster's disk I/O raising from 
> ~300Mb/s to ~1.5Gb/s, and get back the ~300Mb/s 30min later or more
> 
> I check some of tablet node INFO log, they are always doing compaction, 
> compacting 1~ 100s of thousands rows.
> 
> My question:
> 1. Are the maintenance manager is rewriting the whole table?  3000 rows 
> upsert once will trigger a rewriting the whole table?
> 2. Does the background I/O have impacts to the scan performance.
> 3. About the number of hash partitioned buckets,  I partitioned the table to 
> 24 or 120 buckets, what's the difference in upsert and scan performance? and 
> what is the best practices?
> 4. What is the recommended setting for tablet server memory hard limit?
> 
> Thanks.
> 
> jacky...@gmail.com 
>  
> From: Todd Lipcon 
> Date: 2016-08-17 01:58
> To: user 
> Subject: Re: abnormal high disk I/O rate when upsert into kudu table?
> Hi Jacky,
> 
> Answers inline below
> 
> On Tue, Aug 16, 2016 at 8:13 AM, jacky...@gmail.com 
>  mailto:jacky...@gmail.com>> 
> wrote:
> Dear Kudu Developers, 
> 
> I am a new tester for kudu, our kudu cluster has 3+12 nodes, 3 seperated 
> master node and 12 tablet node, 
> each node has 128GB memory, and 1 SSD for WAL, 6 1TB SAS for data
> 
> we are using CDH 5.7.0 with impala-kudu 2.7.0 and kudu 0.9.1 parcels, we set 
> 16GB memory hard limit for each tablet node.
> 
> Sounds like a good cluster setup. Thanks for providing the details. 
> 
>  
> one of our test table is about 80-100 columns and 1 key column, with java 
> client, we can insert/upsert into the kudu table about 100,000/s
> the kudu table has 300m rows, and about 300,000 rows update per day, we also 
> use java client upsert API to update the rows
> 
> we found the kudu cluster maybe encounter abnormal high disk I/O rate, about 
> 1.5-2.0Gb/s, even we just update 1,000~10,000 rows/s
> i would like to know, with our row update frequency, is the cluster high disk 
> rate normal or not?
> 
> Are you upserts randomly spread across the range of rows in the table? If so, 
> then when the updates flush, they'll trigger compactions of the updates and 
> inserted rows into the existing data. This will cause, over time, a rewrite 
> of the whole table, in order to incorporate the updates.
> 
> This background I/O is run by the "maintenance manager". You can visit 
> http://tablet-server:8050/maintenance-manager 
>  to see a dashboard of 
> currently running maintenance operations such as compactions.
> 
> The maintenance manager runs a preset number of threads, so the amount of 
> background I/O you're experiencing won't increase if you increase the number 
> of upserts.
> 
> I'm curious, is the background I/O causing an issue, or just unexpected?
> 
> Thanks
> -Todd
> -- 
> Todd Lipcon
> Software Engineer, Cloudera

Time Series Data

2016-08-10 Thread Benjamin Kim

Correct me if I’m wrong, but I remember a jira in the works or in the roadmap 
regarding time series data and how Kudu can better handle it.

Can someone forward me that information? I want to, at least, inform my company 
on how Kudu will be solution for this.

Thanks,
Ben

Re: Performance Question

2016-08-03 Thread Benjamin Kim

Hi Todd,

Here is an excerpt from another thread focused on Spark with Dan Burkert.

To your first question about `CREATE TABLE` syntax with Kudu/Spark SQL, I do 
not think we support that at this point.  I haven't looked deeply into it, but 
we may hit issues specifying Kudu-specific options (partitioning, column 
encoding, etc.).  Probably issues that can be worked through eventually, 
though.  If you are interested in contributing to Kudu, this is an area that 
could obviously use improvement!  Most or all of our Spark features have been 
completely community driven to date.

I am assuming that more Spark support along with semantic changes below will be 
incorporated into Kudu 0.9.1.

As a rule we do not release new features in patch releases, but the good news 
is that we are releasing regularly, and our next scheduled release is for the 
August timeframe (see JD's roadmap 
<https://lists.apache.org/thread.html/1a3b949e715a74d7f26bd9c102247441a06d16d077324ba39a662e2a@1455234076@%3Cdev.kudu.apache.org%3E>
 email about what we are aiming to include).  Also, Cloudera does publish 
snapshot versions of the Spark connector here 
<https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/>, so the 
jars are available if you don't mind using snapshots.

Anyone know of a better way to make unique primary keys other than using UUID 
to make every row unique if there is no unique column (or combination thereof) 
to use.

Not that I know of.  In general it's pretty rare to have a dataset without a 
natural primary key (even if it's just all of the columns), but in those cases 
UUID is a good solution.

This is what I am using. I know auto incrementing is coming down the line 
(don’t know when), but is there a way to simulate this in Kudu using Spark out 
of curiosity?

To my knowledge there is no plan to have auto increment in Kudu.  Distributed, 
consistent, auto incrementing counters is a difficult problem, and I don't 
think there are any known solutions that would be fast enough for Kudu (happy 
to be proven wrong, though!).

The most important want is the first one. It helps with the non-programmers.

Thanks,
Ben

> On Jul 18, 2016, at 10:32 AM, Todd Lipcon  wrote:
> 
> On Mon, Jul 18, 2016 at 10:31 AM, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> Todd,
> 
> Thanks for the info. I was going to upgrade after the testing, but now, it 
> looks like I will have to do it earlier than expected.
> 
> I will do the upgrade, then resume.
> 
> OK, sounds good. The upgrade shouldn't invalidate any performance testing or 
> anything -- just fixes this important bug.
> 
> -Todd
> 
> 
>> On Jul 18, 2016, at 10:29 AM, Todd Lipcon > <mailto:t...@cloudera.com>> wrote:
>> 
>> Hi Ben,
>> 
>> Any chance that you are running Kudu 0.9.0 instead of 0.9.1? There's a known 
>> serious bug in 0.9.0 which can cause this kind of corruption.
>> 
>> Assuming that you are running with replication count 3 this time, you should 
>> be able to move aside that tablet metadata file and start the server. It 
>> will recreate a new repaired replica automatically.
>> 
>> -Todd
>> 
>> On Mon, Jul 18, 2016 at 10:28 AM, Benjamin Kim > <mailto:bbuil...@gmail.com>> wrote:
>> During my re-population of the Kudu table, I am getting this error trying to 
>> restart a tablet server after it went down. The job that populates this 
>> table has been running for over a week.
>> 
>> [libprotobuf ERROR google/protobuf/message_lite.cc:123] Can't parse message 
>> of type "kudu.tablet.TabletSuperBlockPB" because it is missing required 
>> fields: rowsets[2324].columns[15].block
>> F0718 17:01:26.783571   468 tablet_server_main.cc:55] Check failed: _s.ok() 
>> Bad status: IO error: Could not init Tablet Manager: Failed to open tablet 
>> metadata for tablet: 24637ee6f3e5440181ce3f20b1b298ba: Failed to load tablet 
>> metadata for tablet id 24637ee6f3e5440181ce3f20b1b298ba: Could not load 
>> tablet metadata from 
>> /mnt/data1/kudu/data/tablet-meta/24637ee6f3e5440181ce3f20b1b298ba: Unable to 
>> parse PB from path: 
>> /mnt/data1/kudu/data/tablet-meta/24637ee6f3e5440181ce3f20b1b298ba
>> *** Check failure stack trace: ***
>> @   0x7d794d  google::LogMessage::Fail()
>> @   0x7d984d  google::LogMessage::SendToLog()
>> @   0x7d7489  google::LogMessage::Flush()
>> @   0x7da2ef  google::LogMessageFatal::~LogMessageFatal()
>> @   0x78172b  (unknown)
>> @   0x344d41ed5d  (unknown)
>> @   0x7811d1  (unknown)
>> 
>> Does anyone know what this means?
>> 
>> Thanks,
>> Ben
>&g

Re: Performance Question

2016-08-01 Thread Benjamin Kim

It looks like my time is up on evaluating Kudu. To summarize, it looks very 
promising and a very likely candidate for use in production many of our use 
cases. The performance is outstanding compared to other solutions this far. 
Along with the simplicity of installation, setup, and configuration, this 
already puts many worries at ease. Here is a list of my conclusions.

Even at close to 1B rows on a 15 node cluster (24 cores, 64GB memory, 12TB 
storage), performance did degrade by at most 30% at times but mostly remained 
inline with the numbers below 80% of the time. The only problem I did encounter 
was a timeout UPSERTing data into the table once we hit >850M rows. This was 
due to a memory limit being hit. With proper configuration, this can be 
avoided. Maybe, this can be a self tuning feature based on table statistics?
The current implementation of the Spark connector is sufficient for most, but 
it still can be improved and fully featured to match other connectors for other 
data stores out there. Full Spark SQL/DataFrame capabilities would be very 
welcome. I know this will come in time.

I believe once Kudu becomes production ready, it be worth another test run. 
Hopefully, Spark support is fully supported and fully functional by then. With 
Spark 2.0 just release, I can see many cases for Kudu in conjunction with 
Structured Streaming as its data store, not only as a data destination, but as 
a spill-to for streaming (continuous) DataFrames and checkpointing job state 
and data to pick up from when/if a fatal job failure were to occur.

To all the folks on the Kudu project, thanks for all your help and work. I look 
forward to what is to come.

Cheers,
Ben

> On Jul 27, 2016, at 11:12 AM, Jean-Daniel Cryans  wrote:
> 
> Hey Ben,
> 
> I fixed a few hangs in the Java client over the past few weeks, so you might 
> be hitting that. To confirm if it's the case, set a timeout that's way 
> higher, like minutes. If it still times out, might be the hang in which case 
> there are some workarounds.
> 
> Otherwise, it might be that your cluster is getting slammed? Have you checked 
> the usuals like high iowait, swapping, etc? Also take a look at the WARNING 
> log from the tservers and see if they complain about long Write RPCs.
> 
> FWIW I've been testing non-stop inserts on a 6 nodes cluster (of which one is 
> just a master) here and I have 318B (318,852,472,816) rows inserted, 43TB on 
> disk post-replication and compression, so I'm not too worried about 800M rows 
> unless they're hundreds of KB each :P
> 
> J-D
> 
> On Tue, Jul 26, 2016 at 5:15 PM, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> I have reached over 800M rows (813,997,990), and now it’s starting to timeout 
> when UPSERTing data.
> 
> 16/07/27 00:04:58 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 17.0 
> (TID 87, prod-dc1-datanode163.pdc1i.gradientx.com 
> <http://prod-dc1-datanode163.pdc1i.gradientx.com/>): 
> com.stumbleupon.async.TimeoutException: Timed out after 3ms when joining 
> Deferred@159286(state=PENDING, result=null, 
> callback=org.kududb.client.AsyncKuduSession$ConvertBatchToListOfResponsesCB@154c94f8
>  -> wakeup thread Executor task launch worker-2, errback=passthrough -> 
> wakeup thread Executor task launch worker-2)
>   at com.stumbleupon.async.Deferred.doJoin(Deferred.java:1177)
>   at com.stumbleupon.async.Deferred.join(Deferred.java:1045)
>   at org.kududb.client.KuduSession.close(KuduSession.java:110)
>   at org.kududb.spark.kudu.KuduContext.writeRows(KuduContext.scala:181)
>   at 
> org.kududb.spark.kudu.KuduContext$$anonfun$writeRows$1.apply(KuduContext.scala:131)
>   at 
> org.kududb.spark.kudu.KuduContext$$anonfun$writeRows$1.apply(KuduContext.scala:130)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 
> 
> Thanks,
> Ben
> 
> 
>> On Jul 18, 2016, at 10:32 AM, Todd Lipcon > <mailto:t...@cloudera.com>> wrote:
>&

HBase-Spark Module

2016-07-29 Thread Benjamin Kim

I would like to know if anyone has tried using the hbase-spark module? I tried 
to follow the examples in conjunction with CDH 5.8.0. I cannot find the 
HBaseTableCatalog class in the module or in any of the Spark jars. Can someone 
help?

Thanks,
Ben
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Pass Credentials through JDBC

2016-07-28 Thread Benjamin Kim

Thank you. I’ll take a look.


> On Jul 28, 2016, at 8:16 AM, Jongyoul Lee  wrote:
> 
> You can find more information on 
> https://issues.apache.org/jira/browse/ZEPPELIN-1146 
> <https://issues.apache.org/jira/browse/ZEPPELIN-1146>
> 
> Hope this help,
> Jongyoul
> 
> On Fri, Jul 29, 2016 at 12:08 AM, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> Hi Jonyoul,
> 
> How would I enter credentials with the current version of Zeppelin? Do you 
> know of a way to make it work now?
> 
> Thanks,
> Ben
> 
>> On Jul 28, 2016, at 8:06 AM, Jongyoul Lee > <mailto:jongy...@gmail.com>> wrote:
>> 
>> Hi,
>> 
>> In my plan, this is a next step after 
>> https://issues.apache.org/jira/browse/ZEPPELIN-1210 
>> <https://issues.apache.org/jira/browse/ZEPPELIN-1210>. But for now, there's 
>> no way to pass your credentials with hiding them. I hope that would be 
>> included in 0.7.0.
>> 
>> Regards,
>> Jongyoul
>> 
>> On Thu, Jul 28, 2016 at 11:22 PM, Benjamin Kim > <mailto:bbuil...@gmail.com>> wrote:
>> How do I pass username and password to JDBC connections such as Phoenix and 
>> Hive that are my own? Can my credentials be passed from Shiro after logging 
>> in? Or do I have to set them at the Interpreter level without sharing them? 
>> I wish there was more information on this.
>> 
>> Thanks,
>> Ben
>> 
>> 
>> 
>> -- 
>> 이종열, Jongyoul Lee, 李宗烈
>> http://madeng.net <http://madeng.net/>
> 
> 
> 
> 
> -- 
> 이종열, Jongyoul Lee, 李宗烈
> http://madeng.net <http://madeng.net/>

Re: Pass Credentials through JDBC

2016-07-28 Thread Benjamin Kim

Hi Jonyoul,

How would I enter credentials with the current version of Zeppelin? Do you know 
of a way to make it work now?

Thanks,
Ben

> On Jul 28, 2016, at 8:06 AM, Jongyoul Lee  wrote:
> 
> Hi,
> 
> In my plan, this is a next step after 
> https://issues.apache.org/jira/browse/ZEPPELIN-1210 
> <https://issues.apache.org/jira/browse/ZEPPELIN-1210>. But for now, there's 
> no way to pass your credentials with hiding them. I hope that would be 
> included in 0.7.0.
> 
> Regards,
> Jongyoul
> 
> On Thu, Jul 28, 2016 at 11:22 PM, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> How do I pass username and password to JDBC connections such as Phoenix and 
> Hive that are my own? Can my credentials be passed from Shiro after logging 
> in? Or do I have to set them at the Interpreter level without sharing them? I 
> wish there was more information on this.
> 
> Thanks,
> Ben
> 
> 
> 
> -- 
> 이종열, Jongyoul Lee, 李宗烈
> http://madeng.net <http://madeng.net/>

Pass Credentials through JDBC

2016-07-28 Thread Benjamin Kim

How do I pass username and password to JDBC connections such as Phoenix and 
Hive that are my own? Can my credentials be passed from Shiro after logging in? 
Or do I have to set them at the Interpreter level without sharing them? I wish 
there was more information on this.

Thanks,
Ben

Re: Performance Question

2016-07-27 Thread Benjamin Kim

JD,

I checked the WARNING logs and found this.

W0728 05:36:25.453966 22452 consensus_peers.cc:326] T 
377e17bb8a93493993cec74f72c2d7a5 P cb652bf9e56347beb93039802c26085f -> Peer 
da37b6f955184aa68f0cde68f85c5e03 
(prod-dc1-datanode163.pdc1i.gradientx.com:7050): Couldn't send request to peer 
da37b6f955184aa68f0cde68f85c5e03 for tablet 377e17bb8a93493993cec74f72c2d7a5. 
Status: Remote error: Service unavailable: Soft memory limit exceeded (at 
100.06% of capacity). Retrying in the next heartbeat period. Already tried 
93566 times.

In Cloudera Manager, the CGroup Soft Memory Limit is set to -1. How can I fix 
this? Is it Linux related? Also, the Kudu Tablet Server Hard Memory Limit is 
set to 4GB.

Thanks,
Ben

> On Jul 27, 2016, at 11:12 AM, Jean-Daniel Cryans  wrote:
> 
> Hey Ben,
> 
> I fixed a few hangs in the Java client over the past few weeks, so you might 
> be hitting that. To confirm if it's the case, set a timeout that's way 
> higher, like minutes. If it still times out, might be the hang in which case 
> there are some workarounds.
> 
> Otherwise, it might be that your cluster is getting slammed? Have you checked 
> the usuals like high iowait, swapping, etc? Also take a look at the WARNING 
> log from the tservers and see if they complain about long Write RPCs.
> 
> FWIW I've been testing non-stop inserts on a 6 nodes cluster (of which one is 
> just a master) here and I have 318B (318,852,472,816) rows inserted, 43TB on 
> disk post-replication and compression, so I'm not too worried about 800M rows 
> unless they're hundreds of KB each :P
> 
> J-D
> 
> On Tue, Jul 26, 2016 at 5:15 PM, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> I have reached over 800M rows (813,997,990), and now it’s starting to timeout 
> when UPSERTing data.
> 
> 16/07/27 00:04:58 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 17.0 
> (TID 87, prod-dc1-datanode163.pdc1i.gradientx.com 
> <http://prod-dc1-datanode163.pdc1i.gradientx.com/>): 
> com.stumbleupon.async.TimeoutException: Timed out after 3ms when joining 
> Deferred@159286(state=PENDING, result=null, 
> callback=org.kududb.client.AsyncKuduSession$ConvertBatchToListOfResponsesCB@154c94f8
>  -> wakeup thread Executor task launch worker-2, errback=passthrough -> 
> wakeup thread Executor task launch worker-2)
>   at com.stumbleupon.async.Deferred.doJoin(Deferred.java:1177)
>   at com.stumbleupon.async.Deferred.join(Deferred.java:1045)
>   at org.kududb.client.KuduSession.close(KuduSession.java:110)
>   at org.kududb.spark.kudu.KuduContext.writeRows(KuduContext.scala:181)
>   at 
> org.kududb.spark.kudu.KuduContext$$anonfun$writeRows$1.apply(KuduContext.scala:131)
>   at 
> org.kududb.spark.kudu.KuduContext$$anonfun$writeRows$1.apply(KuduContext.scala:130)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 
> 
> Thanks,
> Ben
> 
> 
>> On Jul 18, 2016, at 10:32 AM, Todd Lipcon > <mailto:t...@cloudera.com>> wrote:
>> 
>> On Mon, Jul 18, 2016 at 10:31 AM, Benjamin Kim > <mailto:bbuil...@gmail.com>> wrote:
>> Todd,
>> 
>> Thanks for the info. I was going to upgrade after the testing, but now, it 
>> looks like I will have to do it earlier than expected.
>> 
>> I will do the upgrade, then resume.
>> 
>> OK, sounds good. The upgrade shouldn't invalidate any performance testing or 
>> anything -- just fixes this important bug.
>> 
>> -Todd
>> 
>> 
>>> On Jul 18, 2016, at 10:29 AM, Todd Lipcon >> <mailto:t...@cloudera.com>> wrote:
>>> 
>>> Hi Ben,
>>> 
>>> Any chance that you are running Kudu 0.9.0 instead of 0.9.1? There's a 
>>> known serious bug in 0.9.0 which can cause this kind of corruption.
>>> 
>>> Assuming that you are running with replication count 3 this time,

Re: Performance Question

2016-07-26 Thread Benjamin Kim

I have reached over 800M rows (813,997,990), and now it’s starting to timeout 
when UPSERTing data.

16/07/27 00:04:58 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 17.0 
(TID 87, prod-dc1-datanode163.pdc1i.gradientx.com): 
com.stumbleupon.async.TimeoutException: Timed out after 3ms when joining 
Deferred@159286(state=PENDING, result=null, 
callback=org.kududb.client.AsyncKuduSession$ConvertBatchToListOfResponsesCB@154c94f8
 -> wakeup thread Executor task launch worker-2, errback=passthrough -> wakeup 
thread Executor task launch worker-2)
at com.stumbleupon.async.Deferred.doJoin(Deferred.java:1177)
at com.stumbleupon.async.Deferred.join(Deferred.java:1045)
at org.kududb.client.KuduSession.close(KuduSession.java:110)
at org.kududb.spark.kudu.KuduContext.writeRows(KuduContext.scala:181)
at 
org.kududb.spark.kudu.KuduContext$$anonfun$writeRows$1.apply(KuduContext.scala:131)
at 
org.kududb.spark.kudu.KuduContext$$anonfun$writeRows$1.apply(KuduContext.scala:130)
at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


Thanks,
Ben


> On Jul 18, 2016, at 10:32 AM, Todd Lipcon  wrote:
> 
> On Mon, Jul 18, 2016 at 10:31 AM, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> Todd,
> 
> Thanks for the info. I was going to upgrade after the testing, but now, it 
> looks like I will have to do it earlier than expected.
> 
> I will do the upgrade, then resume.
> 
> OK, sounds good. The upgrade shouldn't invalidate any performance testing or 
> anything -- just fixes this important bug.
> 
> -Todd
> 
> 
>> On Jul 18, 2016, at 10:29 AM, Todd Lipcon > <mailto:t...@cloudera.com>> wrote:
>> 
>> Hi Ben,
>> 
>> Any chance that you are running Kudu 0.9.0 instead of 0.9.1? There's a known 
>> serious bug in 0.9.0 which can cause this kind of corruption.
>> 
>> Assuming that you are running with replication count 3 this time, you should 
>> be able to move aside that tablet metadata file and start the server. It 
>> will recreate a new repaired replica automatically.
>> 
>> -Todd
>> 
>> On Mon, Jul 18, 2016 at 10:28 AM, Benjamin Kim > <mailto:bbuil...@gmail.com>> wrote:
>> During my re-population of the Kudu table, I am getting this error trying to 
>> restart a tablet server after it went down. The job that populates this 
>> table has been running for over a week.
>> 
>> [libprotobuf ERROR google/protobuf/message_lite.cc:123] Can't parse message 
>> of type "kudu.tablet.TabletSuperBlockPB" because it is missing required 
>> fields: rowsets[2324].columns[15].block
>> F0718 17:01:26.783571   468 tablet_server_main.cc:55] Check failed: _s.ok() 
>> Bad status: IO error: Could not init Tablet Manager: Failed to open tablet 
>> metadata for tablet: 24637ee6f3e5440181ce3f20b1b298ba: Failed to load tablet 
>> metadata for tablet id 24637ee6f3e5440181ce3f20b1b298ba: Could not load 
>> tablet metadata from 
>> /mnt/data1/kudu/data/tablet-meta/24637ee6f3e5440181ce3f20b1b298ba: Unable to 
>> parse PB from path: 
>> /mnt/data1/kudu/data/tablet-meta/24637ee6f3e5440181ce3f20b1b298ba
>> *** Check failure stack trace: ***
>> @   0x7d794d  google::LogMessage::Fail()
>> @   0x7d984d  google::LogMessage::SendToLog()
>> @   0x7d7489  google::LogMessage::Flush()
>> @   0x7da2ef  google::LogMessageFatal::~LogMessageFatal()
>>     @   0x78172b  (unknown)
>> @   0x344d41ed5d  (unknown)
>> @   0x7811d1  (unknown)
>> 
>> Does anyone know what this means?
>> 
>> Thanks,
>> Ben
>> 
>> 
>>> On Jul 11, 2016, at 10:47 AM, Todd Lipcon >> <mailto:t...@cloudera.com>> wrote:
>>> 
>>> On Mon, Jul 11, 2016 at 10:40 AM, Benjamin Kim >> <mailto:bbuil...@gmail.com>> wrote:
>>> Todd,
>>> 
>>> I had it

Re: How to connect HBase and Spark using Python?

2016-07-22 Thread Benjamin Kim

It is included in Cloudera’s CDH 5.8.

> On Jul 22, 2016, at 6:13 PM, Mail.com  wrote:
> 
> Hbase Spark module will be available with Hbase 2.0. Is that out yet?
> 
>> On Jul 22, 2016, at 8:50 PM, Def_Os  wrote:
>> 
>> So it appears it should be possible to use HBase's new hbase-spark module, if
>> you follow this pattern:
>> https://hbase.apache.org/book.html#_sparksql_dataframes
>> 
>> Unfortunately, when I run my example from PySpark, I get the following
>> exception:
>> 
>> 
>>> py4j.protocol.Py4JJavaError: An error occurred while calling o120.save.
>>> : java.lang.RuntimeException: org.apache.hadoop.hbase.spark.DefaultSource
>>> does not allow create table as select.
>>>   at scala.sys.package$.error(package.scala:27)
>>>   at
>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:259)
>>>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
>>>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>   at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>   at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>   at java.lang.reflect.Method.invoke(Method.java:606)
>>>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>>>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
>>>   at py4j.Gateway.invoke(Gateway.java:259)
>>>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>>>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>>>   at py4j.GatewayConnection.run(GatewayConnection.java:209)
>>>   at java.lang.Thread.run(Thread.java:745)
>> 
>> Even when I created the table in HBase first, it still failed.
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-connect-HBase-and-Spark-using-Python-tp27372p27397.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> 
> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: transtition SQLContext to SparkSession

2016-07-18 Thread Benjamin Kim

From what I read, there is no more Contexts.

"SparkContext, SQLContext, HiveContext merged into SparkSession"

I have not tested it, but I don’t know if it’s true.

Cheers,
Ben


> On Jul 18, 2016, at 8:37 AM, Koert Kuipers  wrote:
> 
> in my codebase i would like to gradually transition to SparkSession, so while 
> i start using SparkSession i also want a SQLContext to be available as before 
> (but with a deprecated warning when i use it). this should be easy since 
> SQLContext is now a wrapper for SparkSession.
> 
> so basically:
> val session = SparkSession.builder.set(..., ...).getOrCreate()
> val sqlc = new SQLContext(session)
> 
> however this doesnt work, the SQLContext constructor i am trying to use is 
> private. SparkSession.sqlContext is also private.
> 
> am i missing something?
> 
> a non-gradual switch is not very realistic in any significant codebase, and i 
> do not want to create SparkSession and SQLContext independendly (both from 
> same SparkContext) since that can only lead to confusion and inconsistent 
> settings.

Re: Performance Question

2016-07-18 Thread Benjamin Kim

Todd,

I upgraded, deleted the table and recreated it again because it was 
unaccessible, and re-introduced the downed tablet server after clearing out all 
kudu directories.

The Spark Streaming job is repopulating again.

Thanks,
Ben


> On Jul 18, 2016, at 10:32 AM, Todd Lipcon  wrote:
> 
> On Mon, Jul 18, 2016 at 10:31 AM, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> Todd,
> 
> Thanks for the info. I was going to upgrade after the testing, but now, it 
> looks like I will have to do it earlier than expected.
> 
> I will do the upgrade, then resume.
> 
> OK, sounds good. The upgrade shouldn't invalidate any performance testing or 
> anything -- just fixes this important bug.
> 
> -Todd
> 
> 
>> On Jul 18, 2016, at 10:29 AM, Todd Lipcon > <mailto:t...@cloudera.com>> wrote:
>> 
>> Hi Ben,
>> 
>> Any chance that you are running Kudu 0.9.0 instead of 0.9.1? There's a known 
>> serious bug in 0.9.0 which can cause this kind of corruption.
>> 
>> Assuming that you are running with replication count 3 this time, you should 
>> be able to move aside that tablet metadata file and start the server. It 
>> will recreate a new repaired replica automatically.
>> 
>> -Todd
>> 
>> On Mon, Jul 18, 2016 at 10:28 AM, Benjamin Kim > <mailto:bbuil...@gmail.com>> wrote:
>> During my re-population of the Kudu table, I am getting this error trying to 
>> restart a tablet server after it went down. The job that populates this 
>> table has been running for over a week.
>> 
>> [libprotobuf ERROR google/protobuf/message_lite.cc:123] Can't parse message 
>> of type "kudu.tablet.TabletSuperBlockPB" because it is missing required 
>> fields: rowsets[2324].columns[15].block
>> F0718 17:01:26.783571   468 tablet_server_main.cc:55] Check failed: _s.ok() 
>> Bad status: IO error: Could not init Tablet Manager: Failed to open tablet 
>> metadata for tablet: 24637ee6f3e5440181ce3f20b1b298ba: Failed to load tablet 
>> metadata for tablet id 24637ee6f3e5440181ce3f20b1b298ba: Could not load 
>> tablet metadata from 
>> /mnt/data1/kudu/data/tablet-meta/24637ee6f3e5440181ce3f20b1b298ba: Unable to 
>> parse PB from path: 
>> /mnt/data1/kudu/data/tablet-meta/24637ee6f3e5440181ce3f20b1b298ba
>> *** Check failure stack trace: ***
>> @   0x7d794d  google::LogMessage::Fail()
>> @   0x7d984d  google::LogMessage::SendToLog()
>> @   0x7d7489  google::LogMessage::Flush()
>> @   0x7da2ef  google::LogMessageFatal::~LogMessageFatal()
>> @   0x78172b  (unknown)
>> @   0x344d41ed5d  (unknown)
>> @   0x7811d1  (unknown)
>> 
>> Does anyone know what this means?
>> 
>> Thanks,
>> Ben
>> 
>> 
>>> On Jul 11, 2016, at 10:47 AM, Todd Lipcon >> <mailto:t...@cloudera.com>> wrote:
>>> 
>>> On Mon, Jul 11, 2016 at 10:40 AM, Benjamin Kim >> <mailto:bbuil...@gmail.com>> wrote:
>>> Todd,
>>> 
>>> I had it at one replica. Do I have to recreate?
>>> 
>>> We don't currently have the ability to "accept data loss" on a tablet (or 
>>> set of tablets). If the machine is gone for good, then currently the only 
>>> easy way to recover is to recreate the table. If this sounds really 
>>> painful, though, maybe we can work up some kind of tool you could use to 
>>> just recreate the missing tablets (with those rows lost).
>>> 
>>> -Todd
>>> 
>>>> On Jul 11, 2016, at 10:37 AM, Todd Lipcon >>> <mailto:t...@cloudera.com>> wrote:
>>>> 
>>>> Hey Ben,
>>>> 
>>>> Is the table that you're querying replicated? Or was it created with only 
>>>> one replica per tablet?
>>>> 
>>>> -Todd
>>>> 
>>>> On Mon, Jul 11, 2016 at 10:35 AM, Benjamin Kim >>> <mailto:b...@amobee.com>> wrote:
>>>> Over the weekend, a tablet server went down. It’s not coming back up. So, 
>>>> I decommissioned it and removed it from the cluster. Then, I restarted 
>>>> Kudu because I was getting a timeout  exception trying to do counts on the 
>>>> table. Now, when I try again. I get the same error.
>>>> 
>>>> 16/07/11 17:32:36 WARN scheduler.TaskSetManager: Lost task 468.3 in stage 
>>>> 0.0 (TID 603, prod-dc1-datanode167.pdc1i.gradientx.com 
>>>> <http://prod-dc1-datanode167.pdc1i.gradientx.com/>): 
>>>> com.stumb

Re: Performance Question

2016-07-18 Thread Benjamin Kim

Todd,

Thanks for the info. I was going to upgrade after the testing, but now, it 
looks like I will have to do it earlier than expected.

I will do the upgrade, then resume.

Cheers,
Ben


> On Jul 18, 2016, at 10:29 AM, Todd Lipcon  wrote:
> 
> Hi Ben,
> 
> Any chance that you are running Kudu 0.9.0 instead of 0.9.1? There's a known 
> serious bug in 0.9.0 which can cause this kind of corruption.
> 
> Assuming that you are running with replication count 3 this time, you should 
> be able to move aside that tablet metadata file and start the server. It will 
> recreate a new repaired replica automatically.
> 
> -Todd
> 
> On Mon, Jul 18, 2016 at 10:28 AM, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> During my re-population of the Kudu table, I am getting this error trying to 
> restart a tablet server after it went down. The job that populates this table 
> has been running for over a week.
> 
> [libprotobuf ERROR google/protobuf/message_lite.cc:123] Can't parse message 
> of type "kudu.tablet.TabletSuperBlockPB" because it is missing required 
> fields: rowsets[2324].columns[15].block
> F0718 17:01:26.783571   468 tablet_server_main.cc:55] Check failed: _s.ok() 
> Bad status: IO error: Could not init Tablet Manager: Failed to open tablet 
> metadata for tablet: 24637ee6f3e5440181ce3f20b1b298ba: Failed to load tablet 
> metadata for tablet id 24637ee6f3e5440181ce3f20b1b298ba: Could not load 
> tablet metadata from 
> /mnt/data1/kudu/data/tablet-meta/24637ee6f3e5440181ce3f20b1b298ba: Unable to 
> parse PB from path: 
> /mnt/data1/kudu/data/tablet-meta/24637ee6f3e5440181ce3f20b1b298ba
> *** Check failure stack trace: ***
> @   0x7d794d  google::LogMessage::Fail()
> @   0x7d984d  google::LogMessage::SendToLog()
> @   0x7d7489  google::LogMessage::Flush()
> @   0x7da2ef  google::LogMessageFatal::~LogMessageFatal()
> @   0x78172b  (unknown)
> @   0x344d41ed5d  (unknown)
> @   0x7811d1  (unknown)
> 
> Does anyone know what this means?
> 
> Thanks,
> Ben
> 
> 
>> On Jul 11, 2016, at 10:47 AM, Todd Lipcon > <mailto:t...@cloudera.com>> wrote:
>> 
>> On Mon, Jul 11, 2016 at 10:40 AM, Benjamin Kim > <mailto:bbuil...@gmail.com>> wrote:
>> Todd,
>> 
>> I had it at one replica. Do I have to recreate?
>> 
>> We don't currently have the ability to "accept data loss" on a tablet (or 
>> set of tablets). If the machine is gone for good, then currently the only 
>> easy way to recover is to recreate the table. If this sounds really painful, 
>> though, maybe we can work up some kind of tool you could use to just 
>> recreate the missing tablets (with those rows lost).
>> 
>> -Todd
>> 
>>> On Jul 11, 2016, at 10:37 AM, Todd Lipcon >> <mailto:t...@cloudera.com>> wrote:
>>> 
>>> Hey Ben,
>>> 
>>> Is the table that you're querying replicated? Or was it created with only 
>>> one replica per tablet?
>>> 
>>> -Todd
>>> 
>>> On Mon, Jul 11, 2016 at 10:35 AM, Benjamin Kim >> <mailto:b...@amobee.com>> wrote:
>>> Over the weekend, a tablet server went down. It’s not coming back up. So, I 
>>> decommissioned it and removed it from the cluster. Then, I restarted Kudu 
>>> because I was getting a timeout  exception trying to do counts on the 
>>> table. Now, when I try again. I get the same error.
>>> 
>>> 16/07/11 17:32:36 WARN scheduler.TaskSetManager: Lost task 468.3 in stage 
>>> 0.0 (TID 603, prod-dc1-datanode167.pdc1i.gradientx.com 
>>> <http://prod-dc1-datanode167.pdc1i.gradientx.com/>): 
>>> com.stumbleupon.async.TimeoutException: Timed out after 3ms when 
>>> joining Deferred@712342716(state=PAUSED, result=Deferred@1765902299, 
>>> callback=passthrough -> scanner opened -> wakeup thread Executor task 
>>> launch worker-2, errback=openScanner errback -> passthrough -> wakeup 
>>> thread Executor task launch worker-2)
>>> at com.stumbleupon.async.Deferred.doJoin(Deferred.java:1177)
>>> at com.stumbleupon.async.Deferred.join(Deferred.java:1045)
>>> at org.kududb.client.KuduScanner.nextRows(KuduScanner.java:57)
>>> at org.kududb.spark.kudu.RowResultIteratorScala.hasNext(KuduRDD.scala:99)
>>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>> at 
>>> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:88)
>>> at 
&g

Re: Performance Question

2016-07-18 Thread Benjamin Kim

During my re-population of the Kudu table, I am getting this error trying to 
restart a tablet server after it went down. The job that populates this table 
has been running for over a week.

[libprotobuf ERROR google/protobuf/message_lite.cc:123] Can't parse message of 
type "kudu.tablet.TabletSuperBlockPB" because it is missing required fields: 
rowsets[2324].columns[15].block
F0718 17:01:26.783571   468 tablet_server_main.cc:55] Check failed: _s.ok() Bad 
status: IO error: Could not init Tablet Manager: Failed to open tablet metadata 
for tablet: 24637ee6f3e5440181ce3f20b1b298ba: Failed to load tablet metadata 
for tablet id 24637ee6f3e5440181ce3f20b1b298ba: Could not load tablet metadata 
from /mnt/data1/kudu/data/tablet-meta/24637ee6f3e5440181ce3f20b1b298ba: Unable 
to parse PB from path: 
/mnt/data1/kudu/data/tablet-meta/24637ee6f3e5440181ce3f20b1b298ba
*** Check failure stack trace: ***
@   0x7d794d  google::LogMessage::Fail()
@   0x7d984d  google::LogMessage::SendToLog()
@   0x7d7489  google::LogMessage::Flush()
@   0x7da2ef  google::LogMessageFatal::~LogMessageFatal()
@   0x78172b  (unknown)
@   0x344d41ed5d  (unknown)
@   0x7811d1  (unknown)

Does anyone know what this means?

Thanks,
Ben


> On Jul 11, 2016, at 10:47 AM, Todd Lipcon  wrote:
> 
> On Mon, Jul 11, 2016 at 10:40 AM, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> Todd,
> 
> I had it at one replica. Do I have to recreate?
> 
> We don't currently have the ability to "accept data loss" on a tablet (or set 
> of tablets). If the machine is gone for good, then currently the only easy 
> way to recover is to recreate the table. If this sounds really painful, 
> though, maybe we can work up some kind of tool you could use to just recreate 
> the missing tablets (with those rows lost).
> 
> -Todd
> 
>> On Jul 11, 2016, at 10:37 AM, Todd Lipcon > <mailto:t...@cloudera.com>> wrote:
>> 
>> Hey Ben,
>> 
>> Is the table that you're querying replicated? Or was it created with only 
>> one replica per tablet?
>> 
>> -Todd
>> 
>> On Mon, Jul 11, 2016 at 10:35 AM, Benjamin Kim > <mailto:b...@amobee.com>> wrote:
>> Over the weekend, a tablet server went down. It’s not coming back up. So, I 
>> decommissioned it and removed it from the cluster. Then, I restarted Kudu 
>> because I was getting a timeout  exception trying to do counts on the table. 
>> Now, when I try again. I get the same error.
>> 
>> 16/07/11 17:32:36 WARN scheduler.TaskSetManager: Lost task 468.3 in stage 
>> 0.0 (TID 603, prod-dc1-datanode167.pdc1i.gradientx.com 
>> <http://prod-dc1-datanode167.pdc1i.gradientx.com/>): 
>> com.stumbleupon.async.TimeoutException: Timed out after 3ms when joining 
>> Deferred@712342716(state=PAUSED, result=Deferred@1765902299, 
>> callback=passthrough -> scanner opened -> wakeup thread Executor task launch 
>> worker-2, errback=openScanner errback -> passthrough -> wakeup thread 
>> Executor task launch worker-2)
>> at com.stumbleupon.async.Deferred.doJoin(Deferred.java:1177)
>> at com.stumbleupon.async.Deferred.join(Deferred.java:1045)
>> at org.kududb.client.KuduScanner.nextRows(KuduScanner.java:57)
>> at org.kududb.spark.kudu.RowResultIteratorScala.hasNext(KuduRDD.scala:99)
>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>> at 
>> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:88)
>> at 
>> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
>> at 
>> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>> at 
>> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>> at org.apache.spark.scheduler.Task.run(Task.scala:89)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>> at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(

Re: JDBC Phoenix Authentication

2016-07-15 Thread Benjamin Kim

To follow…

I found out that the Phoenix interpreter doesn’t pass credentials. I see in the 
interpreter logs that it is using zeppelin as the user. Is there a way to pass 
credentials?

Thanks,
Ben


> On Jul 14, 2016, at 8:08 PM, Benjamin Kim  wrote:
> 
> I recently enabled simple authentication and secure authorization in HBase to 
> use LDAP for getting credentials. It works fine using HBase shell and the 
> Phoenix client to access HBase tables and data. Of course, I had to grant 
> permissions first. But now, I can’t do the same using Zeppelin’s JDBC Phoenix 
> Interpreter. I tried putting my username and password in the settings, but 
> still it doesn’t work. Does anyone how to make this work?
> 
> Here is the stack trace.
> 
> class org.apache.phoenix.exception.PhoenixIOException
> org.apache.phoenix.util.ServerUtil.parseServerException(ServerUtil.java:108)
> org.apache.phoenix.query.ConnectionQueryServicesImpl.ensureTableCreated(ConnectionQueryServicesImpl.java:889)
> org.apache.phoenix.query.ConnectionQueryServicesImpl.createTable(ConnectionQueryServicesImpl.java:1223)
> org.apache.phoenix.query.DelegateConnectionQueryServices.createTable(DelegateConnectionQueryServices.java:113)
> org.apache.phoenix.schema.MetaDataClient.createTableInternal(MetaDataClient.java:1937)
> org.apache.phoenix.schema.MetaDataClient.createTable(MetaDataClient.java:751)
> org.apache.phoenix.compile.CreateTableCompiler$2.execute(CreateTableCompiler.java:186)
> org.apache.phoenix.jdbc.PhoenixStatement$2.call(PhoenixStatement.java:320)
> org.apache.phoenix.jdbc.PhoenixStatement$2.call(PhoenixStatement.java:312)
> org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)
> org.apache.phoenix.jdbc.PhoenixStatement.executeMutation(PhoenixStatement.java:310)
> org.apache.phoenix.jdbc.PhoenixStatement.executeUpdate(PhoenixStatement.java:1422)
> org.apache.phoenix.query.ConnectionQueryServicesImpl$12.call(ConnectionQueryServicesImpl.java:1927)
> org.apache.phoenix.query.ConnectionQueryServicesImpl$12.call(ConnectionQueryServicesImpl.java:1896)
> org.apache.phoenix.util.PhoenixContextExecutor.call(PhoenixContextExecutor.java:77)
> org.apache.phoenix.query.ConnectionQueryServicesImpl.init(ConnectionQueryServicesImpl.java:1896)
> org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices(PhoenixDriver.java:180)
> org.apache.phoenix.jdbc.PhoenixEmbeddedDriver.connect(PhoenixEmbeddedDriver.java:132)
> org.apache.phoenix.jdbc.PhoenixDriver.connect(PhoenixDriver.java:151)
> java.sql.DriverManager.getConnection(DriverManager.java:664)
> java.sql.DriverManager.getConnection(DriverManager.java:208)
> org.apache.zeppelin.jdbc.JDBCInterpreter.getConnection(JDBCInterpreter.java:222)
> org.apache.zeppelin.jdbc.JDBCInterpreter.getStatement(JDBCInterpreter.java:233)
> org.apache.zeppelin.jdbc.JDBCInterpreter.executeSql(JDBCInterpreter.java:292)
> org.apache.zeppelin.jdbc.JDBCInterpreter.interpret(JDBCInterpreter.java:396)
> org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:94)
> org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:341)
> org.apache.zeppelin.scheduler.Job.run(Job.java:176)
> org.apache.zeppelin.scheduler.ParallelScheduler$JobRunner.run(ParallelScheduler.java:162)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745)
> 
> Thanks,
> Ben
> 
>

JDBC Phoenix Authentication

2016-07-14 Thread Benjamin Kim

I recently enabled simple authentication and secure authorization in HBase to 
use LDAP for getting credentials. It works fine using HBase shell and the 
Phoenix client to access HBase tables and data. Of course, I had to grant 
permissions first. But now, I can’t do the same using Zeppelin’s JDBC Phoenix 
Interpreter. I tried putting my username and password in the settings, but 
still it doesn’t work. Does anyone how to make this work?

Here is the stack trace.

class org.apache.phoenix.exception.PhoenixIOException
org.apache.phoenix.util.ServerUtil.parseServerException(ServerUtil.java:108)
org.apache.phoenix.query.ConnectionQueryServicesImpl.ensureTableCreated(ConnectionQueryServicesImpl.java:889)
org.apache.phoenix.query.ConnectionQueryServicesImpl.createTable(ConnectionQueryServicesImpl.java:1223)
org.apache.phoenix.query.DelegateConnectionQueryServices.createTable(DelegateConnectionQueryServices.java:113)
org.apache.phoenix.schema.MetaDataClient.createTableInternal(MetaDataClient.java:1937)
org.apache.phoenix.schema.MetaDataClient.createTable(MetaDataClient.java:751)
org.apache.phoenix.compile.CreateTableCompiler$2.execute(CreateTableCompiler.java:186)
org.apache.phoenix.jdbc.PhoenixStatement$2.call(PhoenixStatement.java:320)
org.apache.phoenix.jdbc.PhoenixStatement$2.call(PhoenixStatement.java:312)
org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)
org.apache.phoenix.jdbc.PhoenixStatement.executeMutation(PhoenixStatement.java:310)
org.apache.phoenix.jdbc.PhoenixStatement.executeUpdate(PhoenixStatement.java:1422)
org.apache.phoenix.query.ConnectionQueryServicesImpl$12.call(ConnectionQueryServicesImpl.java:1927)
org.apache.phoenix.query.ConnectionQueryServicesImpl$12.call(ConnectionQueryServicesImpl.java:1896)
org.apache.phoenix.util.PhoenixContextExecutor.call(PhoenixContextExecutor.java:77)
org.apache.phoenix.query.ConnectionQueryServicesImpl.init(ConnectionQueryServicesImpl.java:1896)
org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices(PhoenixDriver.java:180)
org.apache.phoenix.jdbc.PhoenixEmbeddedDriver.connect(PhoenixEmbeddedDriver.java:132)
org.apache.phoenix.jdbc.PhoenixDriver.connect(PhoenixDriver.java:151)
java.sql.DriverManager.getConnection(DriverManager.java:664)
java.sql.DriverManager.getConnection(DriverManager.java:208)
org.apache.zeppelin.jdbc.JDBCInterpreter.getConnection(JDBCInterpreter.java:222)
org.apache.zeppelin.jdbc.JDBCInterpreter.getStatement(JDBCInterpreter.java:233)
org.apache.zeppelin.jdbc.JDBCInterpreter.executeSql(JDBCInterpreter.java:292)
org.apache.zeppelin.jdbc.JDBCInterpreter.interpret(JDBCInterpreter.java:396)
org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:94)
org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:341)
org.apache.zeppelin.scheduler.Job.run(Job.java:176)
org.apache.zeppelin.scheduler.ParallelScheduler$JobRunner.run(ParallelScheduler.java:162)
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
java.util.concurrent.FutureTask.run(FutureTask.java:266)
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)

Thanks,
Ben

Re: Spark Website

2016-07-13 Thread Benjamin Kim

It takes me to the directories instead of the webpage.

> On Jul 13, 2016, at 11:45 AM, manish ranjan  wrote:
> 
> working for me. What do you mean 'as supposed to'?
> 
> ~Manish
> 
> 
> 
> On Wed, Jul 13, 2016 at 11:45 AM, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> Has anyone noticed that the spark.apache.org <http://spark.apache.org/> is 
> not working as supposed to?
> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> 
>

Spark Website

2016-07-13 Thread Benjamin Kim

Has anyone noticed that the spark.apache.org is not working as supposed to?


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Zeppelin 0.6.0 on CDH 5.7.1

2016-07-12 Thread Benjamin Kim

r now
> 
> This error seems to be serialization related. Commonly this can be caused by 
> mismatch versions. What is spark.master set to? Could you try with local[*] 
> instead of yarn-client to see if Spark running by Zeppelin is somehow 
> different?
> 
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>  at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>  at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
>  at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:115)
>  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:64)
> 
> 
> _
> From: Benjamin Kim mailto:bbuil...@gmail.com>>
> Sent: Saturday, July 9, 2016 10:54 PM
> Subject: Re: [ANNOUNCE] Apache Zeppelin 0.6.0 released
> To: mailto:us...@zeppelin.apache.org>>
> Cc: mailto:dev@zeppelin.apache.org>>
> 
> 
> Hi JL,
> 
> Spark is version 1.6.0 and Akka is 2.2.3. But, Cloudera always back ports 
> things from newer versions. They told me that they ported some bug fixes from 
> Spark 2.0.
> 
> Please let me know if you need any more information.
> 
> Cheers,
> Ben
> 
> 
> On Jul 9, 2016, at 10:12 PM, Jongyoul Lee  <mailto:jongy...@gmail.com>> wrote:
> 
> Hi all,
> 
> Could you guys check the CDH's version of Spark? As I've tested it for a long 
> time ago, it is a little bit different from vanila one, for example, the 
> CDH's one has a different version of some depedencies including Akka.
> 
> Regards,
> JL
> 
> On Sat, Jul 9, 2016 at 11:47 PM, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> Feix,
> 
> I added hive-site.xml to the conf directory and restarted Zeppelin. Now, I 
> get another error:
> 
> java.lang.ClassNotFoundException: 
> line1631424043$24.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68)
> at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
> at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
> at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
> at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStre

Re: Performance Question

2016-07-11 Thread Benjamin Kim

Todd,

It’s no problem to start over again. But, a tool like that would be helpful. 
Gaps in data can be accommodated for by just back filling.

Thanks,
Ben

> On Jul 11, 2016, at 10:47 AM, Todd Lipcon  wrote:
> 
> On Mon, Jul 11, 2016 at 10:40 AM, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> Todd,
> 
> I had it at one replica. Do I have to recreate?
> 
> We don't currently have the ability to "accept data loss" on a tablet (or set 
> of tablets). If the machine is gone for good, then currently the only easy 
> way to recover is to recreate the table. If this sounds really painful, 
> though, maybe we can work up some kind of tool you could use to just recreate 
> the missing tablets (with those rows lost).
> 
> -Todd
> 
>> On Jul 11, 2016, at 10:37 AM, Todd Lipcon > <mailto:t...@cloudera.com>> wrote:
>> 
>> Hey Ben,
>> 
>> Is the table that you're querying replicated? Or was it created with only 
>> one replica per tablet?
>> 
>> -Todd
>> 
>> On Mon, Jul 11, 2016 at 10:35 AM, Benjamin Kim > <mailto:b...@amobee.com>> wrote:
>> Over the weekend, a tablet server went down. It’s not coming back up. So, I 
>> decommissioned it and removed it from the cluster. Then, I restarted Kudu 
>> because I was getting a timeout  exception trying to do counts on the table. 
>> Now, when I try again. I get the same error.
>> 
>> 16/07/11 17:32:36 WARN scheduler.TaskSetManager: Lost task 468.3 in stage 
>> 0.0 (TID 603, prod-dc1-datanode167.pdc1i.gradientx.com 
>> <http://prod-dc1-datanode167.pdc1i.gradientx.com/>): 
>> com.stumbleupon.async.TimeoutException: Timed out after 3ms when joining 
>> Deferred@712342716(state=PAUSED, result=Deferred@1765902299, 
>> callback=passthrough -> scanner opened -> wakeup thread Executor task launch 
>> worker-2, errback=openScanner errback -> passthrough -> wakeup thread 
>> Executor task launch worker-2)
>> at com.stumbleupon.async.Deferred.doJoin(Deferred.java:1177)
>> at com.stumbleupon.async.Deferred.join(Deferred.java:1045)
>> at org.kududb.client.KuduScanner.nextRows(KuduScanner.java:57)
>> at org.kududb.spark.kudu.RowResultIteratorScala.hasNext(KuduRDD.scala:99)
>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>> at 
>> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:88)
>> at 
>> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
>> at 
>> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>> at 
>> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>> at org.apache.spark.scheduler.Task.run(Task.scala:89)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>> at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>> 
>> Does anyone know how to recover from this?
>> 
>> Thanks,
>> Benjamin Kim
>> Data Solutions Architect
>> 
>> [a•mo•bee] (n.) the company defining digital marketing.
>> 
>> Mobile: +1 818 635 2900 
>> 3250 Ocean Park Blvd, Suite 200  |  Santa Monica, CA 90405  |  
>> www.amobee.com <http://www.amobee.com/>
>>> On Jul 6, 2016, at 9:46 AM, Dan Burkert >> <mailto:d...@cloudera.com>> wrote:
>>> 
>>> 
>>> 
>>> On Wed, Jul 6, 2016 at 7:05 AM, Benjamin Kim >> <mailto:bbuil...@gmail.com>> wrote:
>>> Over the weekend, the row count is up to <500M. I will give it another few 
>>> days to get to 1B rows. I still get consistent times ~15s for doing row 
>>> counts despite the amount of data growing.
>>> 
>>> On another note, I got a solicitation email from SnappyDa

Re: Performance Question

2016-07-11 Thread Benjamin Kim

Todd,

I had it at one replica. Do I have to recreate?

Thanks,
Ben


> On Jul 11, 2016, at 10:37 AM, Todd Lipcon  wrote:
> 
> Hey Ben,
> 
> Is the table that you're querying replicated? Or was it created with only one 
> replica per tablet?
> 
> -Todd
> 
> On Mon, Jul 11, 2016 at 10:35 AM, Benjamin Kim  <mailto:b...@amobee.com>> wrote:
> Over the weekend, a tablet server went down. It’s not coming back up. So, I 
> decommissioned it and removed it from the cluster. Then, I restarted Kudu 
> because I was getting a timeout  exception trying to do counts on the table. 
> Now, when I try again. I get the same error.
> 
> 16/07/11 17:32:36 WARN scheduler.TaskSetManager: Lost task 468.3 in stage 0.0 
> (TID 603, prod-dc1-datanode167.pdc1i.gradientx.com 
> <http://prod-dc1-datanode167.pdc1i.gradientx.com/>): 
> com.stumbleupon.async.TimeoutException: Timed out after 3ms when joining 
> Deferred@712342716(state=PAUSED, result=Deferred@1765902299, 
> callback=passthrough -> scanner opened -> wakeup thread Executor task launch 
> worker-2, errback=openScanner errback -> passthrough -> wakeup thread 
> Executor task launch worker-2)
> at com.stumbleupon.async.Deferred.doJoin(Deferred.java:1177)
> at com.stumbleupon.async.Deferred.join(Deferred.java:1045)
> at org.kududb.client.KuduScanner.nextRows(KuduScanner.java:57)
> at org.kududb.spark.kudu.RowResultIteratorScala.hasNext(KuduRDD.scala:99)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:88)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 
> Does anyone know how to recover from this?
> 
> Thanks,
> Benjamin Kim
> Data Solutions Architect
> 
> [a•mo•bee] (n.) the company defining digital marketing.
> 
> Mobile: +1 818 635 2900 
> 3250 Ocean Park Blvd, Suite 200  |  Santa Monica, CA 90405  |  www.amobee.com 
> <http://www.amobee.com/>
>> On Jul 6, 2016, at 9:46 AM, Dan Burkert > <mailto:d...@cloudera.com>> wrote:
>> 
>> 
>> 
>> On Wed, Jul 6, 2016 at 7:05 AM, Benjamin Kim > <mailto:bbuil...@gmail.com>> wrote:
>> Over the weekend, the row count is up to <500M. I will give it another few 
>> days to get to 1B rows. I still get consistent times ~15s for doing row 
>> counts despite the amount of data growing.
>> 
>> On another note, I got a solicitation email from SnappyData to evaluate 
>> their product. They claim to be the “Spark Data Store” with tight 
>> integration with Spark executors. It claims to be an OLTP and OLAP system 
>> with being an in-memory data store first then to disk. After going to 
>> several Spark events, it would seem that this is the new “hot” area for 
>> vendors. They all (MemSQL, Redis, Aerospike, Datastax, etc.) claim to be the 
>> best "Spark Data Store”. I’m wondering if Kudu will become this too? With 
>> the performance I’ve seen so far, it would seem that it can be a contender. 
>> All that is needed is a hardened Spark connector package, I would think. The 
>> next evaluation I will be conducting is to see if SnappyData’s claims are 
>> valid by doing my own tests.
>> 
>> It's hard to compare Kudu against any other data store without a lot of 
>> analysis and thorough benchmarking, but it is certainly a goal of Kudu to be 
>> a great platform for ingesting and analyzing data through Spark.  Up till 
>

Re: Performance Question

2016-07-11 Thread Benjamin Kim

Over the weekend, a tablet server went down. It’s not coming back up. So, I 
decommissioned it and removed it from the cluster. Then, I restarted Kudu 
because I was getting a timeout  exception trying to do counts on the table. 
Now, when I try again. I get the same error.

16/07/11 17:32:36 WARN scheduler.TaskSetManager: Lost task 468.3 in stage 0.0 
(TID 603, 
prod-dc1-datanode167.pdc1i.gradientx.com<http://prod-dc1-datanode167.pdc1i.gradientx.com>):
 com.stumbleupon.async.TimeoutException: Timed out after 3ms when joining 
Deferred@712342716(state=PAUSED, result=Deferred@1765902299, 
callback=passthrough -> scanner opened -> wakeup thread Executor task launch 
worker-2, errback=openScanner errback -> passthrough -> wakeup thread Executor 
task launch worker-2)
at com.stumbleupon.async.Deferred.doJoin(Deferred.java:1177)
at com.stumbleupon.async.Deferred.join(Deferred.java:1045)
at org.kududb.client.KuduScanner.nextRows(KuduScanner.java:57)
at org.kududb.spark.kudu.RowResultIteratorScala.hasNext(KuduRDD.scala:99)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:88)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Does anyone know how to recover from this?

Thanks,
Benjamin Kim
Data Solutions Architect

[a•mo•bee] (n.) the company defining digital marketing.

Mobile: +1 818 635 2900
3250 Ocean Park Blvd, Suite 200  |  Santa Monica, CA 90405  |  
www.amobee.com<http://www.amobee.com/>

On Jul 6, 2016, at 9:46 AM, Dan Burkert 
mailto:d...@cloudera.com>> wrote:



On Wed, Jul 6, 2016 at 7:05 AM, Benjamin Kim 
mailto:bbuil...@gmail.com>> wrote:
Over the weekend, the row count is up to <500M. I will give it another few days 
to get to 1B rows. I still get consistent times ~15s for doing row counts 
despite the amount of data growing.

On another note, I got a solicitation email from SnappyData to evaluate their 
product. They claim to be the “Spark Data Store” with tight integration with 
Spark executors. It claims to be an OLTP and OLAP system with being an 
in-memory data store first then to disk. After going to several Spark events, 
it would seem that this is the new “hot” area for vendors. They all (MemSQL, 
Redis, Aerospike, Datastax, etc.) claim to be the best "Spark Data Store”. I’m 
wondering if Kudu will become this too? With the performance I’ve seen so far, 
it would seem that it can be a contender. All that is needed is a hardened 
Spark connector package, I would think. The next evaluation I will be 
conducting is to see if SnappyData’s claims are valid by doing my own tests.

It's hard to compare Kudu against any other data store without a lot of 
analysis and thorough benchmarking, but it is certainly a goal of Kudu to be a 
great platform for ingesting and analyzing data through Spark.  Up till this 
point most of the Spark work has been community driven, but more thorough 
integration testing of the Spark connector is going to be a focus going forward.

- Dan


Cheers,
Ben



On Jun 15, 2016, at 12:47 AM, Todd Lipcon 
mailto:t...@cloudera.com>> wrote:


Hi Benjamin,

What workload are you using for benchmarks? Using spark or something more 
custom? rdd or data frame or SQL, etc? Maybe you can share the schema and some 
queries

Todd

Todd

On Jun 15, 2016 8:10 AM, "Benjamin Kim" 
mailto:bbuil...@gmail.com>> wrote:
Hi Todd,

Now that Kudu 0.9.0 is out. I have done some tests. Already, I am impressed. 
Compared to HBase, read and write performance are better. Write performance has 
the greatest improvement (> 4x), while read is > 1.5x. Albeit, these are only 
preliminary tests. Do you know of a way to really do some conclusive tests? I 
want to s

Re: [ANNOUNCE] Apache Zeppelin 0.6.0 released

2016-07-09 Thread Benjamin Kim

Hi JL,

Spark is version 1.6.0 and Akka is 2.2.3. But, Cloudera always back ports 
things from newer versions. They told me that they ported some bug fixes from 
Spark 2.0.

Please let me know if you need any more information.

Cheers,
Ben


> On Jul 9, 2016, at 10:12 PM, Jongyoul Lee  wrote:
> 
> Hi all,
> 
> Could you guys check the CDH's version of Spark? As I've tested it for a long 
> time ago, it is a little bit different from vanila one, for example, the 
> CDH's one has a different version of some depedencies including Akka.
> 
> Regards,
> JL
> 
> On Sat, Jul 9, 2016 at 11:47 PM, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> Feix,
> 
> I added hive-site.xml to the conf directory and restarted Zeppelin. Now, I 
> get another error:
> 
> java.lang.ClassNotFoundException: 
> line1631424043$24.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>   at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
>   at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>   at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
>   at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputSt

Re: [ANNOUNCE] Apache Zeppelin 0.6.0 released

2016-07-09 Thread Benjamin Kim

(ObjectInputStream.java:1924)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:115)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:64)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Thanks for helping.

Ben


> On Jul 8, 2016, at 10:47 PM, Felix Cheung  wrote:
> 
> For #1, do you know if Spark can find the Hive metastore config (typically in 
> hive-site.xml) - Spark's log should indicate that.
> 
> 
> _____
> From: Benjamin Kim mailto:bbuil...@gmail.com>>
> Sent: Friday, July 8, 2016 6:53 AM
> Subject: Re: [ANNOUNCE] Apache Zeppelin 0.6.0 released
> To: mailto:users@zeppelin.apache.org>>
> Cc: mailto:d...@zeppelin.apache.org>>
> 
> 
> Felix,
> 
> I forgot to add that I built Zeppelin from source 
> http://mirrors.ibiblio.org/apache/zeppelin/zeppelin-0.6.0/zeppelin-0.6.0.tgz 
> <http://mirrors.ibiblio.org/apache/zeppelin/zeppelin-0.6.0/zeppelin-0.6.0.tgz>
>  using this command "mvn clean package -DskipTests -Pspark-1.6 -Phadoop-2.6 
> -Dspark.version=1.6.0-cdh5.7.1 -Dhadoop.version=2.6.0-cdh5.7.1 -Ppyspark 
> -Pvendor-repo -Pbuild-distr -Dhbase.hbase.version=1.2.0-cdh5.7.1 
> -Dhbase.hadoop.version=2.6.0-cdh5.7.1”.
> 
> I did this because we are using HBase 1.2 within CDH 5.7.1.
> 
> Hope this helps clarify.
> 
> Thanks,
> Ben
> 
> 
> 
> On Jul 8, 2016, at 2:01 AM, Felix Cheung  <mailto:felixcheun...@hotmail.com>> wrote:
> 
> Is this possibly caused by CDH requiring a build-from-source instead of the 
> official binary releases?
> 
> 
> 
> 
> 
> On Thu, Jul 7, 2016 at 8:22 PM -0700, "Benjamin Kim"  <mailto:bbuil...@gmail.com>> wrote:
> 
> Moon,
> 
> My environmental setup consists of an 18 node CentOS 6.7 cluster with 24 
> cores, 64GB, 12TB storage each:
> 3 of those nodes are used as Zookeeper servers, HDFS name nodes, and a YARN 
> resource manager
> 15 are for data nodes
> jdk1.8_60 and CDH 5.7.1 installed
> 
> Another node is an app server, 24 cores, 128GB memory, 1TB storage. It has 
> Zeppelin 0.6.0 and Livy 0.2.0 running on it. Plus, Hive Metastore and 
> HiveServer2, Hue, and Oozie are running on it from CDH 5.7.1.
> 
> This is our QA cluster where we are testing before deploying to production.
> 
> If you need more information, please let me know.
> 
> Thanks,
> Ben
> 
>  
> 
> On Jul 7, 2016, at 7:54 PM, moon soo Lee  <mailto:m...@apache.org>> wrote:
> 
> Randy,
> 
> Helium is not included in 0.6.0 release. Could you check which version are 
> you using?
> I created a fix for 500 errors from Helium URL in master branch. 
>

Re: Performance Question

2016-07-08 Thread Benjamin Kim

Dan,

This is good to hear as we are heavily invested in Spark as are many of our 
competitors in the AdTech/Telecom world. It would be nice to have Kudu be on 
par with the other data store technologies in terms of Spark usability, so as 
to not choose one based on “who provides it now in production”, as management 
tends to say.

Cheers,
Ben

> On Jul 6, 2016, at 9:46 AM, Dan Burkert  wrote:
> 
> 
> 
> On Wed, Jul 6, 2016 at 7:05 AM, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> Over the weekend, the row count is up to <500M. I will give it another few 
> days to get to 1B rows. I still get consistent times ~15s for doing row 
> counts despite the amount of data growing.
> 
> On another note, I got a solicitation email from SnappyData to evaluate their 
> product. They claim to be the “Spark Data Store” with tight integration with 
> Spark executors. It claims to be an OLTP and OLAP system with being an 
> in-memory data store first then to disk. After going to several Spark events, 
> it would seem that this is the new “hot” area for vendors. They all (MemSQL, 
> Redis, Aerospike, Datastax, etc.) claim to be the best "Spark Data Store”. 
> I’m wondering if Kudu will become this too? With the performance I’ve seen so 
> far, it would seem that it can be a contender. All that is needed is a 
> hardened Spark connector package, I would think. The next evaluation I will 
> be conducting is to see if SnappyData’s claims are valid by doing my own 
> tests.
> 
> It's hard to compare Kudu against any other data store without a lot of 
> analysis and thorough benchmarking, but it is certainly a goal of Kudu to be 
> a great platform for ingesting and analyzing data through Spark.  Up till 
> this point most of the Spark work has been community driven, but more 
> thorough integration testing of the Spark connector is going to be a focus 
> going forward.
> 
> - Dan
> 
>  
> Cheers,
> Ben
> 
> 
> 
>> On Jun 15, 2016, at 12:47 AM, Todd Lipcon > <mailto:t...@cloudera.com>> wrote:
>> 
>> Hi Benjamin,
>> 
>> What workload are you using for benchmarks? Using spark or something more 
>> custom? rdd or data frame or SQL, etc? Maybe you can share the schema and 
>> some queries
>> 
>> Todd
>> 
>> Todd
>> 
>> On Jun 15, 2016 8:10 AM, "Benjamin Kim" > <mailto:bbuil...@gmail.com>> wrote:
>> Hi Todd,
>> 
>> Now that Kudu 0.9.0 is out. I have done some tests. Already, I am impressed. 
>> Compared to HBase, read and write performance are better. Write performance 
>> has the greatest improvement (> 4x), while read is > 1.5x. Albeit, these are 
>> only preliminary tests. Do you know of a way to really do some conclusive 
>> tests? I want to see if I can match your results on my 50 node cluster.
>> 
>> Thanks,
>> Ben
>> 
>>> On May 30, 2016, at 10:33 AM, Todd Lipcon >> <mailto:t...@cloudera.com>> wrote:
>>> 
>>> On Sat, May 28, 2016 at 7:12 AM, Benjamin Kim >> <mailto:bbuil...@gmail.com>> wrote:
>>> Todd,
>>> 
>>> It sounds like Kudu can possibly top or match those numbers put out by 
>>> Aerospike. Do you have any performance statistics published or any 
>>> instructions as to measure them myself as good way to test? In addition, 
>>> this will be a test using Spark, so should I wait for Kudu version 0.9.0 
>>> where support will be built in?
>>> 
>>> We don't have a lot of benchmarks published yet, especially on the write 
>>> side. I've found that thorough cross-system benchmarks are very difficult 
>>> to do fairly and accurately, and often times users end up misguided if they 
>>> pay too much attention to them :) So, given a finite number of developers 
>>> working on Kudu, I think we've tended to spend more time on the project 
>>> itself and less time focusing on "competition". I'm sure there are use 
>>> cases where Kudu will beat out Aerospike, and probably use cases where 
>>> Aerospike will beat Kudu as well.
>>> 
>>> From my perspective, it would be great if you can share some details of 
>>> your workload, especially if there are some areas you're finding Kudu 
>>> lacking. Maybe we can spot some easy code changes we could make to improve 
>>> performance, or suggest a tuning variable you could change.
>>> 
>>> -Todd
>>> 
>>> 
>>>> On May 27, 2016, at 9:19 PM, Todd Lipcon >>> <mailto:t...@cloudera.com>> wrote:
>>>> 
>>>> On Fri, May 2

Re: [ANNOUNCE] Apache Zeppelin 0.6.0 released

2016-07-08 Thread Benjamin Kim

Felix,

I forgot to add that I built Zeppelin from source 
http://mirrors.ibiblio.org/apache/zeppelin/zeppelin-0.6.0/zeppelin-0.6.0.tgz 
<http://mirrors.ibiblio.org/apache/zeppelin/zeppelin-0.6.0/zeppelin-0.6.0.tgz> 
using this command "mvn clean package -DskipTests -Pspark-1.6 -Phadoop-2.6 
-Dspark.version=1.6.0-cdh5.7.1 -Dhadoop.version=2.6.0-cdh5.7.1 -Ppyspark 
-Pvendor-repo -Pbuild-distr -Dhbase.hbase.version=1.2.0-cdh5.7.1 
-Dhbase.hadoop.version=2.6.0-cdh5.7.1”.

I did this because we are using HBase 1.2 within CDH 5.7.1.

Hope this helps clarify.

Thanks,
Ben



> On Jul 8, 2016, at 2:01 AM, Felix Cheung  wrote:
> 
> Is this possibly caused by CDH requiring a build-from-source instead of the 
> official binary releases?
> 
> 
> 
> 
> 
> On Thu, Jul 7, 2016 at 8:22 PM -0700, "Benjamin Kim"  <mailto:bbuil...@gmail.com>> wrote:
> 
> Moon,
> 
> My environmental setup consists of an 18 node CentOS 6.7 cluster with 24 
> cores, 64GB, 12TB storage each:
> 3 of those nodes are used as Zookeeper servers, HDFS name nodes, and a YARN 
> resource manager
> 15 are for data nodes
> jdk1.8_60 and CDH 5.7.1 installed
> 
> Another node is an app server, 24 cores, 128GB memory, 1TB storage. It has 
> Zeppelin 0.6.0 and Livy 0.2.0 running on it. Plus, Hive Metastore and 
> HiveServer2, Hue, and Oozie are running on it from CDH 5.7.1.
> 
> This is our QA cluster where we are testing before deploying to production.
> 
> If you need more information, please let me know.
> 
> Thanks,
> Ben
> 
>  
> 
>> On Jul 7, 2016, at 7:54 PM, moon soo Lee > <mailto:m...@apache.org>> wrote:
>> 
>> Randy,
>> 
>> Helium is not included in 0.6.0 release. Could you check which version are 
>> you using?
>> I created a fix for 500 errors from Helium URL in master branch. 
>> https://github.com/apache/zeppelin/pull/1150 
>> <https://github.com/apache/zeppelin/pull/1150>
>> 
>> Ben,
>> I can not reproduce the error, could you share how to reproduce error, or 
>> share your environment?
>> 
>> Thanks,
>> moon
>> 
>> On Thu, Jul 7, 2016 at 4:02 PM Randy Gelhausen > <mailto:rgel...@gmail.com>> wrote:
>> I don't- I hoped providing that information may help finding & fixing the 
>> problem.
>> 
>> On Thu, Jul 7, 2016 at 5:53 PM, Benjamin Kim > <mailto:bbuil...@gmail.com>> wrote:
>> Hi Randy,
>> 
>> Do you know of any way to fix it or know of a workaround?
>> 
>> Thanks,
>> Ben
>> 
>>> On Jul 7, 2016, at 2:08 PM, Randy Gelhausen >> <mailto:rgel...@gmail.com>> wrote:
>>> 
>>> HTTP 500 errors from a Helium URL
>> 
>> 
>

Re: [ANNOUNCE] Apache Zeppelin 0.6.0 released

2016-07-07 Thread Benjamin Kim

Moon,

My environmental setup consists of an 18 node CentOS 6.7 cluster with 24 cores, 
64GB, 12TB storage each:
3 of those nodes are used as Zookeeper servers, HDFS name nodes, and a YARN 
resource manager
15 are for data nodes
jdk1.8_60 and CDH 5.7.1 installed

Another node is an app server, 24 cores, 128GB memory, 1TB storage. It has 
Zeppelin 0.6.0 and Livy 0.2.0 running on it. Plus, Hive Metastore and 
HiveServer2, Hue, and Oozie are running on it from CDH 5.7.1.

This is our QA cluster where we are testing before deploying to production.

If you need more information, please let me know.

Thanks,
Ben

 

> On Jul 7, 2016, at 7:54 PM, moon soo Lee  wrote:
> 
> Randy,
> 
> Helium is not included in 0.6.0 release. Could you check which version are 
> you using?
> I created a fix for 500 errors from Helium URL in master branch. 
> https://github.com/apache/zeppelin/pull/1150 
> <https://github.com/apache/zeppelin/pull/1150>
> 
> Ben,
> I can not reproduce the error, could you share how to reproduce error, or 
> share your environment?
> 
> Thanks,
> moon
> 
> On Thu, Jul 7, 2016 at 4:02 PM Randy Gelhausen  <mailto:rgel...@gmail.com>> wrote:
> I don't- I hoped providing that information may help finding & fixing the 
> problem.
> 
> On Thu, Jul 7, 2016 at 5:53 PM, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> Hi Randy,
> 
> Do you know of any way to fix it or know of a workaround?
> 
> Thanks,
> Ben
> 
>> On Jul 7, 2016, at 2:08 PM, Randy Gelhausen > <mailto:rgel...@gmail.com>> wrote:
>> 
>> HTTP 500 errors from a Helium URL
> 
>

Re: [ANNOUNCE] Apache Zeppelin 0.6.0 released

2016-07-07 Thread Benjamin Kim

Hi Randy,

Do you know of any way to fix it or know of a workaround?

Thanks,
Ben

> On Jul 7, 2016, at 2:08 PM, Randy Gelhausen  wrote:
> 
> HTTP 500 errors from a Helium URL

Re: [ANNOUNCE] Apache Zeppelin 0.6.0 released

2016-07-07 Thread Benjamin Kim

To whom it may concern:

After upgrading to Zeppelin 0.6.0, I am having a couple interpreter anomalies. 
Please look below, and I hope that there will be an easy fix for them.

1. Spark SQL gives me in this error in the Zeppelin Tutorial notebook, but the 
Scala code to populate and register the temp table runs fine.

java.lang.NullPointerException
at 
org.apache.spark.sql.hive.client.ClientWrapper.conf(ClientWrapper.scala:205)
at 
org.apache.spark.sql.hive.client.ClientWrapper.client(ClientWrapper.scala:261)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:273)
at 
org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:228)
at 
org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:227)
at 
org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:270)
at org.apache.spark.sql.hive.HiveQLDialect.parse(HiveContext.scala:65)
at 
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211)
at 
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211)
at 
org.apache.spark.sql.execution.SparkSQLParser$$anonfun$org$apache$spark$sql$execution$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:114)
at 
org.apache.spark.sql.execution.SparkSQLParser$$anonfun$org$apache$spark$sql$execution$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:113)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
at 
scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:34)
at 
org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:208)
at 
org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:208)
at 
org.apache.spark.sql.execution.datasources.DDLParser.parse(DDLParser.scala:43)
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:231)
at org.apache.spark.sql.hive.HiveContext.parseSql(HiveContext.scala:333)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.zeppelin.spark.SparkSqlInterpreter.interpret(SparkSqlInterpreter.java:117)
at 
org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:94)
at 
org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:341)
at org.apache.zeppelin.scheduler.Job.run(Job.java:176)
at 
org.apache.zeppelin.scheduler.ParallelScheduler$JobRunner.run(ParallelScheduler.java:162)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thr

Re: Shiro LDAP w/ Search Bind Authentication

2016-07-06 Thread Benjamin Kim

Rob,

I got it to work without having to use those settings. I guess Shiro gets 
around our LDAP authentication.

Thanks,
Ben


> On Jul 6, 2016, at 3:33 PM, Rob Anderson  wrote:
> 
> You can find some documentation on it here: 
> https://zeppelin.apache.org/docs/0.7.0-SNAPSHOT/security/shiroauthentication.html
>  
> <https://zeppelin.apache.org/docs/0.7.0-SNAPSHOT/security/shiroauthentication.html>
> 
> I believe you'll need to be running the .6 release or .7 snapshot to use 
> shiro.
> 
> We're authing against AD via ldaps calls without issue.  We're then using 
> group memberships to define roles and control access to notebooks.
> 
> Hope that helps.
> 
> Rob
> 
> 
> On Wed, Jul 6, 2016 at 2:01 PM, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> I have been trying to find documentation on how to enable LDAP 
> authentication, but I cannot find how to enter the values for these 
> configurations. This is necessary because our LDAP server is secured. Here 
> are the properties that I need to set:
> ldap_cert
> use_start_tls
> bind_dn
> bind_password
> 
> Can someone help?
> 
> Thanks,
> Ben
> 
>

Shiro LDAP w/ Search Bind Authentication

2016-07-06 Thread Benjamin Kim

I have been trying to find documentation on how to enable LDAP authentication, 
but I cannot find how to enter the values for these configurations. This is 
necessary because our LDAP server is secured. Here are the properties that I 
need to set:
ldap_cert
use_start_tls
bind_dn
bind_password

Can someone help?

Thanks,
Ben

Re: SnappyData and Structured Streaming

2016-07-06 Thread Benjamin Kim

Jags,

Thanks for the details. This makes things much clearer. I saw in the Spark 
roadmap that version 2.1 will add the SQL capabilities mentioned here. It looks 
like, gradually, the Spark community is coming to the same conclusions that the 
SnappyData folks have come to a while back in terms of Streaming. But, there is 
always the need for a better way to store data underlying Spark. The State 
Store information was informative too. I can envision that it can use this data 
store too if need be.

Thanks again,
Ben

> On Jul 6, 2016, at 8:52 AM, Jags Ramnarayan  wrote:
> 
> The plan is to fully integrate with the new structured streaming API and 
> implementation in an upcoming release. But, we will continue offering several 
> extensions. Few noted below ...
> 
> - the store (streaming sink) will offer a lot more capabilities like 
> transactions, replicated tables, partitioned row and column oriented tables 
> to suit different types of workloads. 
> - While streaming API(scala) in snappydata itself will change a bit to become 
> fully compatible with structured streaming(SchemaDStream will go away), we 
> will continue to offer SQL support for streams so they can be managed from 
> external clients (JDBC, ODBC), their partitions can share the same 
> partitioning strategy as the underlying table where it might be stored, and 
> even registrations of continuous queries from remote clients. 
> 
> While building streaming apps using the Spark APi offers tremendous 
> flexibility we also want to make it simple for apps to work with streams just 
> using SQL. For instance, you should be able to declaratively specify a table 
> as a sink to a stream(i.e. using SQL). For example, you can specify a "TopK 
> Table" (a built in special table for topK analytics using probabilistic data 
> structures) as a sink for a high velocity time series stream like this - 
> "create topK table MostPopularTweets on tweetStreamTable " +
> "options(key 'hashtag', frequencyCol 'retweets', timeSeriesColumn 
> 'tweetTime' )" 
> where 'tweetStreamTable' is created using the 'create stream table ...' SQL 
> syntax. 
> 
> 
> -
> Jags
> SnappyData blog <http://www.snappydata.io/blog>
> Download binary, source <https://github.com/SnappyDataInc/snappydata>
> 
> 
> On Wed, Jul 6, 2016 at 8:02 PM, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> Jags,
> 
> I should have been more specific. I am referring to what I read at 
> http://snappydatainc.github.io/snappydata/streamingWithSQL/ 
> <http://snappydatainc.github.io/snappydata/streamingWithSQL/>, especially the 
> Streaming Tables part. It roughly coincides with the Streaming DataFrames 
> outlined here 
> https://docs.google.com/document/d/1NHKdRSNCbCmJbinLmZuqNA1Pt6CGpFnLVRbzuDUcZVM/edit#heading=h.ff0opfdo6q1h
>  
> <https://docs.google.com/document/d/1NHKdRSNCbCmJbinLmZuqNA1Pt6CGpFnLVRbzuDUcZVM/edit#heading=h.ff0opfdo6q1h>.
>  I don’t if I’m wrong, but they both sound very similar. That’s why I posed 
> this question.
> 
> Thanks,
> Ben
> 
>> On Jul 6, 2016, at 7:03 AM, Jags Ramnarayan > <mailto:jramnara...@snappydata.io>> wrote:
>> 
>> Ben,
>>Note that Snappydata's primary objective is to be a distributed in-memory 
>> DB for mixed workloads (i.e. streaming with transactions and analytic 
>> queries). On the other hand, Spark, till date, is primarily designed as a 
>> processing engine over myriad storage engines (SnappyData being one). So, 
>> the marriage is quite complementary. The difference compared to other stores 
>> is that SnappyData realizes its solution by deeply integrating and 
>> collocating with Spark (i.e. share spark executor memory/resources with the 
>> store) avoiding serializations and shuffle in many situations.
>> 
>> On your specific thought about being similar to Structured streaming, a 
>> better discussion could be a comparison to the recently introduced State 
>> store 
>> <https://docs.google.com/document/d/1-ncawFx8JS5Zyfq1HAEGBx56RDet9wfVp_hDM8ZL254/edit#heading=h.2h7zw4ru3nw7>
>>  (perhaps this is what you meant). 
>> It proposes a KV store for streaming aggregations with support for updates. 
>> The proposed API will, at some point, be pluggable so vendors can easily 
>> support alternate implementations to storage, not just HDFS(default store in 
>> proposed State store). 
>> 
>> 
>> -
>> Jags
>> SnappyData blog <http://www.snappydata.io/blog>
>> Download binary, source <https://github.com/SnappyDataInc/snappydata>
>> 
>> 
>> On Wed, Jul 6, 2016

Re: SnappyData and Structured Streaming

2016-07-06 Thread Benjamin Kim

Jags,

I should have been more specific. I am referring to what I read at 
http://snappydatainc.github.io/snappydata/streamingWithSQL/, especially the 
Streaming Tables part. It roughly coincides with the Streaming DataFrames 
outlined here 
https://docs.google.com/document/d/1NHKdRSNCbCmJbinLmZuqNA1Pt6CGpFnLVRbzuDUcZVM/edit#heading=h.ff0opfdo6q1h.
 I don’t if I’m wrong, but they both sound very similar. That’s why I posed 
this question.

Thanks,
Ben

> On Jul 6, 2016, at 7:03 AM, Jags Ramnarayan  wrote:
> 
> Ben,
>Note that Snappydata's primary objective is to be a distributed in-memory 
> DB for mixed workloads (i.e. streaming with transactions and analytic 
> queries). On the other hand, Spark, till date, is primarily designed as a 
> processing engine over myriad storage engines (SnappyData being one). So, the 
> marriage is quite complementary. The difference compared to other stores is 
> that SnappyData realizes its solution by deeply integrating and collocating 
> with Spark (i.e. share spark executor memory/resources with the store) 
> avoiding serializations and shuffle in many situations.
> 
> On your specific thought about being similar to Structured streaming, a 
> better discussion could be a comparison to the recently introduced State 
> store 
> <https://docs.google.com/document/d/1-ncawFx8JS5Zyfq1HAEGBx56RDet9wfVp_hDM8ZL254/edit#heading=h.2h7zw4ru3nw7>
>  (perhaps this is what you meant). 
> It proposes a KV store for streaming aggregations with support for updates. 
> The proposed API will, at some point, be pluggable so vendors can easily 
> support alternate implementations to storage, not just HDFS(default store in 
> proposed State store). 
> 
> 
> -
> Jags
> SnappyData blog <http://www.snappydata.io/blog>
> Download binary, source <https://github.com/SnappyDataInc/snappydata>
> 
> 
> On Wed, Jul 6, 2016 at 12:49 AM, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> I recently got a sales email from SnappyData, and after reading the 
> documentation about what they offer, it sounds very similar to what 
> Structured Streaming will offer w/o the underlying in-memory, spill-to-disk, 
> CRUD compliant data storage in SnappyData. I was wondering if Structured 
> Streaming is trying to achieve the same on its own or is SnappyData 
> contributing Streaming extensions that they built to the Spark project. 
> Lastly, what does the Spark community think of this so-called “Spark Data 
> Store”?
> 
> Thanks,
> Ben
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> 
>

Re: Performance Question

2016-07-06 Thread Benjamin Kim

Over the weekend, the row count is up to <500M. I will give it another few days 
to get to 1B rows. I still get consistent times ~15s for doing row counts 
despite the amount of data growing.

On another note, I got a solicitation email from SnappyData to evaluate their 
product. They claim to be the “Spark Data Store” with tight integration with 
Spark executors. It claims to be an OLTP and OLAP system with being an 
in-memory data store first then to disk. After going to several Spark events, 
it would seem that this is the new “hot” area for vendors. They all (MemSQL, 
Redis, Aerospike, Datastax, etc.) claim to be the best "Spark Data Store”. I’m 
wondering if Kudu will become this too? With the performance I’ve seen so far, 
it would seem that it can be a contender. All that is needed is a hardened 
Spark connector package, I would think. The next evaluation I will be 
conducting is to see if SnappyData’s claims are valid by doing my own tests.

Cheers,
Ben


> On Jun 15, 2016, at 12:47 AM, Todd Lipcon  wrote:
> 
> Hi Benjamin,
> 
> What workload are you using for benchmarks? Using spark or something more 
> custom? rdd or data frame or SQL, etc? Maybe you can share the schema and 
> some queries
> 
> Todd
> 
> Todd
> 
> On Jun 15, 2016 8:10 AM, "Benjamin Kim"  <mailto:bbuil...@gmail.com>> wrote:
> Hi Todd,
> 
> Now that Kudu 0.9.0 is out. I have done some tests. Already, I am impressed. 
> Compared to HBase, read and write performance are better. Write performance 
> has the greatest improvement (> 4x), while read is > 1.5x. Albeit, these are 
> only preliminary tests. Do you know of a way to really do some conclusive 
> tests? I want to see if I can match your results on my 50 node cluster.
> 
> Thanks,
> Ben
> 
>> On May 30, 2016, at 10:33 AM, Todd Lipcon > <mailto:t...@cloudera.com>> wrote:
>> 
>> On Sat, May 28, 2016 at 7:12 AM, Benjamin Kim > <mailto:bbuil...@gmail.com>> wrote:
>> Todd,
>> 
>> It sounds like Kudu can possibly top or match those numbers put out by 
>> Aerospike. Do you have any performance statistics published or any 
>> instructions as to measure them myself as good way to test? In addition, 
>> this will be a test using Spark, so should I wait for Kudu version 0.9.0 
>> where support will be built in?
>> 
>> We don't have a lot of benchmarks published yet, especially on the write 
>> side. I've found that thorough cross-system benchmarks are very difficult to 
>> do fairly and accurately, and often times users end up misguided if they pay 
>> too much attention to them :) So, given a finite number of developers 
>> working on Kudu, I think we've tended to spend more time on the project 
>> itself and less time focusing on "competition". I'm sure there are use cases 
>> where Kudu will beat out Aerospike, and probably use cases where Aerospike 
>> will beat Kudu as well.
>> 
>> From my perspective, it would be great if you can share some details of your 
>> workload, especially if there are some areas you're finding Kudu lacking. 
>> Maybe we can spot some easy code changes we could make to improve 
>> performance, or suggest a tuning variable you could change.
>> 
>> -Todd
>> 
>> 
>>> On May 27, 2016, at 9:19 PM, Todd Lipcon >> <mailto:t...@cloudera.com>> wrote:
>>> 
>>> On Fri, May 27, 2016 at 8:20 PM, Benjamin Kim >> <mailto:bbuil...@gmail.com>> wrote:
>>> Hi Mike,
>>> 
>>> First of all, thanks for the link. It looks like an interesting read. I 
>>> checked that Aerospike is currently at version 3.8.2.3, and in the article, 
>>> they are evaluating version 3.5.4. The main thing that impressed me was 
>>> their claim that they can beat Cassandra and HBase by 8x for writing and 
>>> 25x for reading. Their big claim to fame is that Aerospike can write 1M 
>>> records per second with only 50 nodes. I wanted to see if this is real.
>>> 
>>> 1M records per second on 50 nodes is pretty doable by Kudu as well, 
>>> depending on the size of your records and the insertion order. I've been 
>>> playing with a ~70 node cluster recently and seen 1M+ writes/second 
>>> sustained, and bursting above 4M. These are 1KB rows with 11 columns, and 
>>> with pretty old HDD-only nodes. I think newer flash-based nodes could do 
>>> better.
>>>  
>>> 
>>> To answer your questions, we have a DMP with user profiles with many 
>>> attributes. We create segmentation information off of these attributes to 
>>> classify them. Then, we ca

SnappyData and Structured Streaming

2016-07-05 Thread Benjamin Kim

I recently got a sales email from SnappyData, and after reading the 
documentation about what they offer, it sounds very similar to what Structured 
Streaming will offer w/o the underlying in-memory, spill-to-disk, CRUD 
compliant data storage in SnappyData. I was wondering if Structured Streaming 
is trying to achieve the same on its own or is SnappyData contributing 
Streaming extensions that they built to the Spark project. Lastly, what does 
the Spark community think of this so-called “Spark Data Store”?

Thanks,
Ben
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: spark interpreter

2016-07-03 Thread Benjamin Kim

I see the download buttons.

Thanks,
Ben

On Saturday, July 2, 2016, moon soo Lee  wrote:

> Thanks for testing it.
>
> When i run 0.6.0-rc1 bin-all binary, i can see CSV, TSV download buttons.
> Could you try clear browser cache?
>
> Regarding credential menu,
> Ahyoung is working on improvement and documentation on
> https://github.com/apache/zeppelin/pull/1100.
>
> Thanks,
> moon
>
> On Fri, Jul 1, 2016 at 11:34 AM Benjamin Kim  > wrote:
>
>> Moon,
>>
>> I have downloaded and tested the bin-all tarball, and it has some
>> deficiencies compared to the build-from-source version.
>>
>>- CSV, TSV download is missing
>>- Doesn’t work with HBase 1.2 in CDH 5.7.0
>>- Spark still does not work with Spark 1.6.0 in CDH 5.7.0 (JDK8)
>>   - Using Livy is a good workaround
>>- Doesn’t work with Phoenix 4.7 in CDH 5.7.0
>>
>>
>> Everything else looks good especially in the area of multi-tenancy and
>> security. I would like to know how to use the Credentials feature on
>> securing usernames and passwords. I couldn’t find documentation on how.
>>
>> Thanks,
>> Ben
>>
>> On Jul 1, 2016, at 9:04 AM, moon soo Lee > > wrote:
>>
>> 0.6.0 is currently in vote in dev@ list.
>>
>> http://apache-zeppelin-dev-mailing-list.75694.x6.nabble.com/VOTE-Apache-Zeppelin-release-0-6-0-rc1-tp11505.html
>>
>> Thanks,
>> moon
>>
>> On Thu, Jun 30, 2016 at 1:54 PM Leon Katsnelson > > wrote:
>>
>>> What is the expected day for v0.6?
>>>
>>>
>>>
>>>
>>> From:moon soo Lee >> >
>>> To:users@zeppelin.apache.org
>>> 
>>> Date:2016/06/30 11:36 AM
>>> Subject:Re: spark interpreter
>>> ------
>>>
>>>
>>>
>>> Hi Ben,
>>>
>>> Livy interpreter is included in 0.6.0. If it is not listed when you
>>> create interpreter setting, could you check if your 'zeppelin.interpreters'
>>> property list Livy interpreter classes? (conf/zeppelin-site.xml)
>>>
>>> Thanks,
>>> moon
>>>
>>> On Wed, Jun 29, 2016 at 11:52 AM Benjamin Kim <*bbuil...@gmail.com*
>>> > wrote:
>>> On a side note…
>>>
>>> Has anyone got the Livy interpreter to be added as an interpreter in the
>>> latest build of Zeppelin 0.6.0? By the way, I have Shiro authentication on.
>>> Could this interfere?
>>>
>>> Thanks,
>>> Ben
>>>
>>>
>>> On Jun 29, 2016, at 11:18 AM, moon soo Lee <*m...@apache.org*
>>> > wrote:
>>>
>>> Livy interpreter internally creates multiple sessions for each user,
>>> independently from 3 binding modes supported in Zeppelin.
>>> Therefore, 'shared' mode, Livy interpreter will create sessions per each
>>> user, 'scoped' or 'isolated' mode will result create sessions per notebook,
>>> per user.
>>>
>>> Notebook is shared among users, they always use the same interpreter
>>> instance/process, for now. I think supporting per user interpreter
>>> instance/process would be future work.
>>>
>>> Thanks,
>>> moon
>>>
>>> On Wed, Jun 29, 2016 at 7:57 AM Chen Song <*chen.song...@gmail.com*
>>> > wrote:
>>> Thanks for your explanation, Moon.
>>>
>>> Following up on this, I can see the difference in terms of single or
>>> multiple interpreter processes.
>>>
>>> With respect to spark drivers, since each interpreter spawns a separate
>>> Spark driver in regular Spark interpreter setting, it is clear to me the
>>> different implications of the 3 binding modes.
>>>
>>> However, when it comes to Livy server with impersonation turned on, I am
>>> a bit confused. Will Livy interpreter always create a new Spark driver
>>> (along with a Spark Context instance) for each user session, regardless of
>>> the binding mode of Livy interpreter? I am not very familiar with Livy, but
>>> from what I could tell, I see no difference between different binding modes
>>> for Livy on as far as how Spark drivers are concerned.
>>>
>>> Last question, when a notebook is shared among users, will they always
>>> use the same interpreter instance/process already created?
>>>
>>> Thanks
>>> Chen
>>>
>>>
>>>
>>> On Fri, Jun 24, 2016 at 11:51 AM moon soo Lee

Re: spark interpreter

2016-07-01 Thread Benjamin Kim

Moon,

I have downloaded and tested the bin-all tarball, and it has some deficiencies 
compared to the build-from-source version.
CSV, TSV download is missing
Doesn’t work with HBase 1.2 in CDH 5.7.0
Spark still does not work with Spark 1.6.0 in CDH 5.7.0 (JDK8)
Using Livy is a good workaround
Doesn’t work with Phoenix 4.7 in CDH 5.7.0

Everything else looks good especially in the area of multi-tenancy and 
security. I would like to know how to use the Credentials feature on securing 
usernames and passwords. I couldn’t find documentation on how.

Thanks,
Ben

> On Jul 1, 2016, at 9:04 AM, moon soo Lee  wrote:
> 
> 0.6.0 is currently in vote in dev@ list.
> http://apache-zeppelin-dev-mailing-list.75694.x6.nabble.com/VOTE-Apache-Zeppelin-release-0-6-0-rc1-tp11505.html
>  
> <http://apache-zeppelin-dev-mailing-list.75694.x6.nabble.com/VOTE-Apache-Zeppelin-release-0-6-0-rc1-tp11505.html>
> 
> Thanks,
> moon
> 
> On Thu, Jun 30, 2016 at 1:54 PM Leon Katsnelson  <mailto:l...@ca.ibm.com>> wrote:
> What is the expected day for v0.6?
> 
> 
> 
> 
> From:moon soo Lee mailto:leemoon...@gmail.com>>
> To:users@zeppelin.apache.org <mailto:users@zeppelin.apache.org>
> Date:2016/06/30 11:36 AM
> Subject:Re: spark interpreter
> 
> 
> 
> Hi Ben,
> 
> Livy interpreter is included in 0.6.0. If it is not listed when you create 
> interpreter setting, could you check if your 'zeppelin.interpreters' property 
> list Livy interpreter classes? (conf/zeppelin-site.xml)
> 
> Thanks,
> moon
> 
> On Wed, Jun 29, 2016 at 11:52 AM Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> On a side note…
> 
> Has anyone got the Livy interpreter to be added as an interpreter in the 
> latest build of Zeppelin 0.6.0? By the way, I have Shiro authentication on. 
> Could this interfere?
> 
> Thanks,
> Ben
> 
> 
> On Jun 29, 2016, at 11:18 AM, moon soo Lee  <mailto:m...@apache.org>> wrote:
> 
> Livy interpreter internally creates multiple sessions for each user, 
> independently from 3 binding modes supported in Zeppelin.
> Therefore, 'shared' mode, Livy interpreter will create sessions per each 
> user, 'scoped' or 'isolated' mode will result create sessions per notebook, 
> per user.
> 
> Notebook is shared among users, they always use the same interpreter 
> instance/process, for now. I think supporting per user interpreter 
> instance/process would be future work.
> 
> Thanks,
> moon
> 
> On Wed, Jun 29, 2016 at 7:57 AM Chen Song  <mailto:chen.song...@gmail.com>> wrote:
> Thanks for your explanation, Moon.
> 
> Following up on this, I can see the difference in terms of single or multiple 
> interpreter processes. 
> 
> With respect to spark drivers, since each interpreter spawns a separate Spark 
> driver in regular Spark interpreter setting, it is clear to me the different 
> implications of the 3 binding modes.
> 
> However, when it comes to Livy server with impersonation turned on, I am a 
> bit confused. Will Livy interpreter always create a new Spark driver (along 
> with a Spark Context instance) for each user session, regardless of the 
> binding mode of Livy interpreter? I am not very familiar with Livy, but from 
> what I could tell, I see no difference between different binding modes for 
> Livy on as far as how Spark drivers are concerned.
> 
> Last question, when a notebook is shared among users, will they always use 
> the same interpreter instance/process already created?
> 
> Thanks
> Chen
> 
> 
> 
> On Fri, Jun 24, 2016 at 11:51 AM moon soo Lee  <mailto:m...@apache.org>> wrote:
> Hi,
> 
> Thanks for asking question. It's not dumb question at all, Zeppelin docs does 
> not explain very well.
> 
> Spark Interpreter, 
> 
> 'shared' mode, a spark interpreter setting spawn a interpreter process to 
> serve all notebooks which binded to this interpreter setting.
> 'scoped' mode, a spark interpreter setting spawn multiple interpreter 
> processes per notebook which binded to this interpreter setting.
> 
> Using Livy interpreter,
> 
> Zeppelin propagate current user information to Livy interpreter. And Livy 
> interpreter creates different session per user via Livy Server.
> 
> 
> Hope this helps.
> 
> Thanks,
> moon
> 
> 
> On Tue, Jun 21, 2016 at 6:41 PM Chen Song  <mailto:chen.song...@gmail.com>> wrote:
> Zeppelin provides 3 binding modes for each interpreter. With `scoped` or 
> `shared` Spark interpreter, every user share the same SparkContext. Sorry for 
> the dumb question, how does it differ from Spark via Ivy Server?
> 
> 
> -- 
> Chen Song
> 
> 
> 
>

Re: Performance Question

2016-06-30 Thread Benjamin Kim

Hi Todd,

I changed the key to be what you suggested, and I can’t tell the difference 
since it was already fast. But, I did get more numbers.

> 104M rows in Kudu table
- read: 8s
- count: 16s
- aggregate: 9s

The time to read took much longer from 0.2s to 8s, counts were the same 16s, 
and aggregate queries look longer from 6s to 9s.

I’m still impressed.

Cheers,
Ben 

> On Jun 15, 2016, at 12:47 AM, Todd Lipcon  wrote:
> 
> Hi Benjamin,
> 
> What workload are you using for benchmarks? Using spark or something more 
> custom? rdd or data frame or SQL, etc? Maybe you can share the schema and 
> some queries
> 
> Todd
> 
> Todd
> 
> On Jun 15, 2016 8:10 AM, "Benjamin Kim"  <mailto:bbuil...@gmail.com>> wrote:
> Hi Todd,
> 
> Now that Kudu 0.9.0 is out. I have done some tests. Already, I am impressed. 
> Compared to HBase, read and write performance are better. Write performance 
> has the greatest improvement (> 4x), while read is > 1.5x. Albeit, these are 
> only preliminary tests. Do you know of a way to really do some conclusive 
> tests? I want to see if I can match your results on my 50 node cluster.
> 
> Thanks,
> Ben
> 
>> On May 30, 2016, at 10:33 AM, Todd Lipcon > <mailto:t...@cloudera.com>> wrote:
>> 
>> On Sat, May 28, 2016 at 7:12 AM, Benjamin Kim > <mailto:bbuil...@gmail.com>> wrote:
>> Todd,
>> 
>> It sounds like Kudu can possibly top or match those numbers put out by 
>> Aerospike. Do you have any performance statistics published or any 
>> instructions as to measure them myself as good way to test? In addition, 
>> this will be a test using Spark, so should I wait for Kudu version 0.9.0 
>> where support will be built in?
>> 
>> We don't have a lot of benchmarks published yet, especially on the write 
>> side. I've found that thorough cross-system benchmarks are very difficult to 
>> do fairly and accurately, and often times users end up misguided if they pay 
>> too much attention to them :) So, given a finite number of developers 
>> working on Kudu, I think we've tended to spend more time on the project 
>> itself and less time focusing on "competition". I'm sure there are use cases 
>> where Kudu will beat out Aerospike, and probably use cases where Aerospike 
>> will beat Kudu as well.
>> 
>> From my perspective, it would be great if you can share some details of your 
>> workload, especially if there are some areas you're finding Kudu lacking. 
>> Maybe we can spot some easy code changes we could make to improve 
>> performance, or suggest a tuning variable you could change.
>> 
>> -Todd
>> 
>> 
>>> On May 27, 2016, at 9:19 PM, Todd Lipcon >> <mailto:t...@cloudera.com>> wrote:
>>> 
>>> On Fri, May 27, 2016 at 8:20 PM, Benjamin Kim >> <mailto:bbuil...@gmail.com>> wrote:
>>> Hi Mike,
>>> 
>>> First of all, thanks for the link. It looks like an interesting read. I 
>>> checked that Aerospike is currently at version 3.8.2.3, and in the article, 
>>> they are evaluating version 3.5.4. The main thing that impressed me was 
>>> their claim that they can beat Cassandra and HBase by 8x for writing and 
>>> 25x for reading. Their big claim to fame is that Aerospike can write 1M 
>>> records per second with only 50 nodes. I wanted to see if this is real.
>>> 
>>> 1M records per second on 50 nodes is pretty doable by Kudu as well, 
>>> depending on the size of your records and the insertion order. I've been 
>>> playing with a ~70 node cluster recently and seen 1M+ writes/second 
>>> sustained, and bursting above 4M. These are 1KB rows with 11 columns, and 
>>> with pretty old HDD-only nodes. I think newer flash-based nodes could do 
>>> better.
>>>  
>>> 
>>> To answer your questions, we have a DMP with user profiles with many 
>>> attributes. We create segmentation information off of these attributes to 
>>> classify them. Then, we can target advertising appropriately for our sales 
>>> department. Much of the data processing is for applying models on all or if 
>>> not most of every profile’s attributes to find similarities (nearest 
>>> neighbor/clustering) over a large number of rows when batch processing or a 
>>> small subset of rows for quick online scoring. So, our use case is a 
>>> typical advanced analytics scenario. We have tried HBase, but it doesn’t 
>>> work well for these types of analytics.
>>> 
>>> I read, that Aerospike in the release notes, they

Re: spark interpreter

2016-06-30 Thread Benjamin Kim

Hi,

I fixed it. I had to restart Livy server as the hue user and not as root.

Thanks,
Ben

> On Jun 30, 2016, at 8:59 AM, Jongyoul Lee  wrote:
> 
> Hi Ben,
> 
> I suggest you stop Z, remove conf/interpreter.json, and start Z again.
> 
> Regards,
> JL
> 
> On Friday, 1 July 2016, moon soo Lee  <mailto:leemoon...@gmail.com>> wrote:
> Hi Ben,
> 
> Livy interpreter is included in 0.6.0. If it is not listed when you create 
> interpreter setting, could you check if your 'zeppelin.interpreters' property 
> list Livy interpreter classes? (conf/zeppelin-site.xml)
> 
> Thanks,
> moon
> 
> On Wed, Jun 29, 2016 at 11:52 AM Benjamin Kim  > wrote:
> On a side note…
> 
> Has anyone got the Livy interpreter to be added as an interpreter in the 
> latest build of Zeppelin 0.6.0? By the way, I have Shiro authentication on. 
> Could this interfere?
> 
> Thanks,
> Ben
> 
> 
>> On Jun 29, 2016, at 11:18 AM, moon soo Lee > > wrote:
>> 
>> Livy interpreter internally creates multiple sessions for each user, 
>> independently from 3 binding modes supported in Zeppelin.
>> Therefore, 'shared' mode, Livy interpreter will create sessions per each 
>> user, 'scoped' or 'isolated' mode will result create sessions per notebook, 
>> per user.
>> 
>> Notebook is shared among users, they always use the same interpreter 
>> instance/process, for now. I think supporting per user interpreter 
>> instance/process would be future work.
>> 
>> Thanks,
>> moon
>> 
>> On Wed, Jun 29, 2016 at 7:57 AM Chen Song > > wrote:
>> Thanks for your explanation, Moon.
>> 
>> Following up on this, I can see the difference in terms of single or 
>> multiple interpreter processes. 
>> 
>> With respect to spark drivers, since each interpreter spawns a separate 
>> Spark driver in regular Spark interpreter setting, it is clear to me the 
>> different implications of the 3 binding modes.
>> 
>> However, when it comes to Livy server with impersonation turned on, I am a 
>> bit confused. Will Livy interpreter always create a new Spark driver (along 
>> with a Spark Context instance) for each user session, regardless of the 
>> binding mode of Livy interpreter? I am not very familiar with Livy, but from 
>> what I could tell, I see no difference between different binding modes for 
>> Livy on as far as how Spark drivers are concerned.
>> 
>> Last question, when a notebook is shared among users, will they always use 
>> the same interpreter instance/process already created?
>> 
>> Thanks
>> Chen
>> 
>> 
>> 
>> On Fri, Jun 24, 2016 at 11:51 AM moon soo Lee > > wrote:
>> Hi,
>> 
>> Thanks for asking question. It's not dumb question at all, Zeppelin docs 
>> does not explain very well.
>> 
>> Spark Interpreter, 
>> 
>> 'shared' mode, a spark interpreter setting spawn a interpreter process to 
>> serve all notebooks which binded to this interpreter setting.
>> 'scoped' mode, a spark interpreter setting spawn multiple interpreter 
>> processes per notebook which binded to this interpreter setting.
>> 
>> Using Livy interpreter,
>> 
>> Zeppelin propagate current user information to Livy interpreter. And Livy 
>> interpreter creates different session per user via Livy Server.
>> 
>> 
>> Hope this helps.
>> 
>> Thanks,
>> moon
>> 
>> 
>> On Tue, Jun 21, 2016 at 6:41 PM Chen Song > > wrote:
>> Zeppelin provides 3 binding modes for each interpreter. With `scoped` or 
>> `shared` Spark interpreter, every user share the same SparkContext. Sorry 
>> for the dumb question, how does it differ from Spark via Ivy Server?
>> 
>> 
>> -- 
>> Chen Song
>> 
> 
> 
> 
> -- 
> 이종열, Jongyoul Lee, 李宗烈
> http://madeng.net <http://madeng.net/>
>

Re: spark interpreter

2016-06-30 Thread Benjamin Kim

Moon,

That worked! There were quite a few more configuration properties added, so I 
added those too in both zeppelin-site.xml and zeppelin-env.sh. But, now, I’m 
getting errors starting a spark context.

Thanks,
Ben

> On Jun 30, 2016, at 8:10 AM, moon soo Lee  wrote:
> 
> Hi Ben,
> 
> Livy interpreter is included in 0.6.0. If it is not listed when you create 
> interpreter setting, could you check if your 'zeppelin.interpreters' property 
> list Livy interpreter classes? (conf/zeppelin-site.xml)
> 
> Thanks,
> moon
> 
> On Wed, Jun 29, 2016 at 11:52 AM Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> On a side note…
> 
> Has anyone got the Livy interpreter to be added as an interpreter in the 
> latest build of Zeppelin 0.6.0? By the way, I have Shiro authentication on. 
> Could this interfere?
> 
> Thanks,
> Ben
> 
> 
>> On Jun 29, 2016, at 11:18 AM, moon soo Lee > <mailto:m...@apache.org>> wrote:
>> 
>> Livy interpreter internally creates multiple sessions for each user, 
>> independently from 3 binding modes supported in Zeppelin.
>> Therefore, 'shared' mode, Livy interpreter will create sessions per each 
>> user, 'scoped' or 'isolated' mode will result create sessions per notebook, 
>> per user.
>> 
>> Notebook is shared among users, they always use the same interpreter 
>> instance/process, for now. I think supporting per user interpreter 
>> instance/process would be future work.
>> 
>> Thanks,
>> moon
>> 
>> On Wed, Jun 29, 2016 at 7:57 AM Chen Song > <mailto:chen.song...@gmail.com>> wrote:
>> Thanks for your explanation, Moon.
>> 
>> Following up on this, I can see the difference in terms of single or 
>> multiple interpreter processes. 
>> 
>> With respect to spark drivers, since each interpreter spawns a separate 
>> Spark driver in regular Spark interpreter setting, it is clear to me the 
>> different implications of the 3 binding modes.
>> 
>> However, when it comes to Livy server with impersonation turned on, I am a 
>> bit confused. Will Livy interpreter always create a new Spark driver (along 
>> with a Spark Context instance) for each user session, regardless of the 
>> binding mode of Livy interpreter? I am not very familiar with Livy, but from 
>> what I could tell, I see no difference between different binding modes for 
>> Livy on as far as how Spark drivers are concerned.
>> 
>> Last question, when a notebook is shared among users, will they always use 
>> the same interpreter instance/process already created?
>> 
>> Thanks
>> Chen
>> 
>> 
>> 
>> On Fri, Jun 24, 2016 at 11:51 AM moon soo Lee > <mailto:m...@apache.org>> wrote:
>> Hi,
>> 
>> Thanks for asking question. It's not dumb question at all, Zeppelin docs 
>> does not explain very well.
>> 
>> Spark Interpreter, 
>> 
>> 'shared' mode, a spark interpreter setting spawn a interpreter process to 
>> serve all notebooks which binded to this interpreter setting.
>> 'scoped' mode, a spark interpreter setting spawn multiple interpreter 
>> processes per notebook which binded to this interpreter setting.
>> 
>> Using Livy interpreter,
>> 
>> Zeppelin propagate current user information to Livy interpreter. And Livy 
>> interpreter creates different session per user via Livy Server.
>> 
>> 
>> Hope this helps.
>> 
>> Thanks,
>> moon
>> 
>> 
>> On Tue, Jun 21, 2016 at 6:41 PM Chen Song > <mailto:chen.song...@gmail.com>> wrote:
>> Zeppelin provides 3 binding modes for each interpreter. With `scoped` or 
>> `shared` Spark interpreter, every user share the same SparkContext. Sorry 
>> for the dumb question, how does it differ from Spark via Ivy Server?
>> 
>> 
>> -- 
>> Chen Song
>> 
>

Re: Performance Question

2016-06-29 Thread Benjamin Kim

Todd,

FYI. The key  is unique for every row so rows are not going to already exist. 
Basically, everything is an INSERT.

val generateUUID = udf(() => UUID.randomUUID().toString)

As you can see, we are using UUID java library to create the key.

Cheers,
Ben

> On Jun 29, 2016, at 1:32 PM, Todd Lipcon  wrote:
> 
> On Wed, Jun 29, 2016 at 11:32 AM, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> Todd,
> 
> I started Spark streaming more events into Kudu. Performance is great there 
> too! With HBase, it’s fast too, but I noticed that it pauses here and there, 
> making it take seconds for > 40k rows at a time, while Kudu doesn’t. The 
> progress bar just blinks by. I will keep this running until it hits 1B rows 
> and rerun my performance tests. This, hopefully, will give better numbers.
> 
> Cool! We have invested a lot of work in making Kudu have consistent 
> performance, like you mentioned. It's generally been my experience that most 
> mature ops people would prefer a system which consistently performs well 
> rather than one which has higher peak performance but occasionally stalls.
> 
> BTW, what is your row key design? One exception to the above is that, if 
> you're doing random inserts, you may see performance "fall off a cliff" once 
> the size of your key columns becomes larger than the aggregate memory size of 
> your cluster, if you're running on hard disks. Our inserts require checks for 
> duplicate keys, and that can cause random disk IOs if your keys don't fit 
> comfortably in cache. This is one area that HBase is fundamentally going to 
> be faster based on its design.
> 
> -Todd
> 
> 
>> On Jun 28, 2016, at 4:26 PM, Todd Lipcon > <mailto:t...@cloudera.com>> wrote:
>> 
>> Cool, thanks for the report, Ben. For what it's worth, I think there's still 
>> some low hanging fruit in the Spark connector for Kudu (for example, I 
>> believe locality on reads is currently broken). So, you can expect 
>> performance to continue to improve in future versions. I'd also be 
>> interested to see results on Kudu for a much larger dataset - my guess is a 
>> lot of the 6 seconds you're seeing is constant overhead from Spark job 
>> setup, etc, given that the performance doesn't seem to get slower as you 
>> went from 700K rows to 13M rows.
>> 
>> -Todd
>> 
>> On Tue, Jun 28, 2016 at 3:03 PM, Benjamin Kim > <mailto:bbuil...@gmail.com>> wrote:
>> FYI.
>> 
>> I did a quick-n-dirty performance test.
>> 
>> First, the setup:
>> QA cluster:
>> 15 data nodes
>> 64GB memory each
>> HBase is using 4GB of memory
>> Kudu is using 1GB of memory
>> 1 HBase/Kudu master node
>> 64GB memory
>> HBase/Kudu master is using 1GB of memory each
>> 10Gb Ethernet
>> 
>> Using Spark on both to load/read events data (84 columns per row), I was 
>> able to record performance for each. On the HBase side, I used the Phoenix 
>> 4.7 Spark plugin where DataFrames can be used directly. On the Kudu side, I 
>> used the Spark connector. I created an events table in Phoenix using the 
>> CREATE TABLE statement and created the equivalent in Kudu using the Spark 
>> method based off of a DataFrame schema.
>> 
>> Here are the numbers for Phoenix/HBase.
>> 1st run:
>> > 715k rows
>> - write: 2.7m
>> 
>> > 715k rows in HBase table
>> - read: 0.1s
>> - count: 3.8s
>> - aggregate: 61s
>> 
>> 2nd run:
>> > 5.2M rows
>> - write: 11m
>> * had 4 region servers go down, had to retry the 5.2M row write
>> 
>> > 5.9M rows in HBase table
>> - read: 8s
>> - count: 3m
>> - aggregate: 46s
>> 
>> 3rd run:
>> > 6.8M rows
>> - write: 9.6m
>> 
>> > 12.7M rows
>> - read: 10s
>> - count: 3m
>> - aggregate: 44s
>> 
>> 
>> Here are the numbers for Kudu.
>> 1st run:
>> > 715k rows
>> - write: 18s
>> 
>> > 715k rows in Kudu table
>> - read: 0.2s
>> - count: 18s
>> - aggregate: 5s
>> 
>> 2nd run:
>> > 5.2M rows
>> - write: 33s
>> 
>> > 5.9M rows in Kudu table
>> - read: 0.2s
>> - count: 16s
>> - aggregate: 6s
>> 
>> 3rd run:
>> > 6.8M rows
>> - write: 27s
>> 
>> > 12.7M rows in Kudu table
>> - read: 0.2s
>> - count: 16s
>> - aggregate: 6s
>> 
>> The Kudu results are impressive if you take these number as-is. Kudu is 
>> close to 18x faster at writing (UPSERT). Kudu is 30x faster at readi

Re: spark interpreter

2016-06-29 Thread Benjamin Kim

On a side note…

Has anyone got the Livy interpreter to be added as an interpreter in the latest 
build of Zeppelin 0.6.0? By the way, I have Shiro authentication on. Could this 
interfere?

Thanks,
Ben


> On Jun 29, 2016, at 11:18 AM, moon soo Lee  wrote:
> 
> Livy interpreter internally creates multiple sessions for each user, 
> independently from 3 binding modes supported in Zeppelin.
> Therefore, 'shared' mode, Livy interpreter will create sessions per each 
> user, 'scoped' or 'isolated' mode will result create sessions per notebook, 
> per user.
> 
> Notebook is shared among users, they always use the same interpreter 
> instance/process, for now. I think supporting per user interpreter 
> instance/process would be future work.
> 
> Thanks,
> moon
> 
> On Wed, Jun 29, 2016 at 7:57 AM Chen Song  > wrote:
> Thanks for your explanation, Moon.
> 
> Following up on this, I can see the difference in terms of single or multiple 
> interpreter processes. 
> 
> With respect to spark drivers, since each interpreter spawns a separate Spark 
> driver in regular Spark interpreter setting, it is clear to me the different 
> implications of the 3 binding modes.
> 
> However, when it comes to Livy server with impersonation turned on, I am a 
> bit confused. Will Livy interpreter always create a new Spark driver (along 
> with a Spark Context instance) for each user session, regardless of the 
> binding mode of Livy interpreter? I am not very familiar with Livy, but from 
> what I could tell, I see no difference between different binding modes for 
> Livy on as far as how Spark drivers are concerned.
> 
> Last question, when a notebook is shared among users, will they always use 
> the same interpreter instance/process already created?
> 
> Thanks
> Chen
> 
> 
> 
> On Fri, Jun 24, 2016 at 11:51 AM moon soo Lee  > wrote:
> Hi,
> 
> Thanks for asking question. It's not dumb question at all, Zeppelin docs does 
> not explain very well.
> 
> Spark Interpreter, 
> 
> 'shared' mode, a spark interpreter setting spawn a interpreter process to 
> serve all notebooks which binded to this interpreter setting.
> 'scoped' mode, a spark interpreter setting spawn multiple interpreter 
> processes per notebook which binded to this interpreter setting.
> 
> Using Livy interpreter,
> 
> Zeppelin propagate current user information to Livy interpreter. And Livy 
> interpreter creates different session per user via Livy Server.
> 
> 
> Hope this helps.
> 
> Thanks,
> moon
> 
> 
> On Tue, Jun 21, 2016 at 6:41 PM Chen Song  > wrote:
> Zeppelin provides 3 binding modes for each interpreter. With `scoped` or 
> `shared` Spark interpreter, every user share the same SparkContext. Sorry for 
> the dumb question, how does it differ from Spark via Ivy Server?
> 
> 
> -- 
> Chen Song
>

Re: Performance Question

2016-06-29 Thread Benjamin Kim

Todd,

I started Spark streaming more events into Kudu. Performance is great there 
too! With HBase, it’s fast too, but I noticed that it pauses here and there, 
making it take seconds for > 40k rows at a time, while Kudu doesn’t. The 
progress bar just blinks by. I will keep this running until it hits 1B rows and 
rerun my performance tests. This, hopefully, will give better numbers.

Thanks,
Ben


> On Jun 28, 2016, at 4:26 PM, Todd Lipcon  wrote:
> 
> Cool, thanks for the report, Ben. For what it's worth, I think there's still 
> some low hanging fruit in the Spark connector for Kudu (for example, I 
> believe locality on reads is currently broken). So, you can expect 
> performance to continue to improve in future versions. I'd also be interested 
> to see results on Kudu for a much larger dataset - my guess is a lot of the 6 
> seconds you're seeing is constant overhead from Spark job setup, etc, given 
> that the performance doesn't seem to get slower as you went from 700K rows to 
> 13M rows.
> 
> -Todd
> 
> On Tue, Jun 28, 2016 at 3:03 PM, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> FYI.
> 
> I did a quick-n-dirty performance test.
> 
> First, the setup:
> QA cluster:
> 15 data nodes
> 64GB memory each
> HBase is using 4GB of memory
> Kudu is using 1GB of memory
> 1 HBase/Kudu master node
> 64GB memory
> HBase/Kudu master is using 1GB of memory each
> 10Gb Ethernet
> 
> Using Spark on both to load/read events data (84 columns per row), I was able 
> to record performance for each. On the HBase side, I used the Phoenix 4.7 
> Spark plugin where DataFrames can be used directly. On the Kudu side, I used 
> the Spark connector. I created an events table in Phoenix using the CREATE 
> TABLE statement and created the equivalent in Kudu using the Spark method 
> based off of a DataFrame schema.
> 
> Here are the numbers for Phoenix/HBase.
> 1st run:
> > 715k rows
> - write: 2.7m
> 
> > 715k rows in HBase table
> - read: 0.1s
> - count: 3.8s
> - aggregate: 61s
> 
> 2nd run:
> > 5.2M rows
> - write: 11m
> * had 4 region servers go down, had to retry the 5.2M row write
> 
> > 5.9M rows in HBase table
> - read: 8s
> - count: 3m
> - aggregate: 46s
> 
> 3rd run:
> > 6.8M rows
> - write: 9.6m
> 
> > 12.7M rows
> - read: 10s
> - count: 3m
> - aggregate: 44s
> 
> 
> Here are the numbers for Kudu.
> 1st run:
> > 715k rows
> - write: 18s
> 
> > 715k rows in Kudu table
> - read: 0.2s
> - count: 18s
> - aggregate: 5s
> 
> 2nd run:
> > 5.2M rows
> - write: 33s
> 
> > 5.9M rows in Kudu table
> - read: 0.2s
> - count: 16s
> - aggregate: 6s
> 
> 3rd run:
> > 6.8M rows
> - write: 27s
> 
> > 12.7M rows in Kudu table
> - read: 0.2s
> - count: 16s
> - aggregate: 6s
> 
> The Kudu results are impressive if you take these number as-is. Kudu is close 
> to 18x faster at writing (UPSERT). Kudu is 30x faster at reading (HBase times 
> increase as data size grows).  Kudu is 7x faster at full row counts. Lastly, 
> Kudu is 3x faster doing an aggregate query (count distinct event_id’s per 
> user_id). *Remember that this is small cluster, times are still respectable 
> for both systems, HBase could have been configured better, and the HBase 
> table could have been better tuned.
> 
> Cheers,
> Ben
> 
> 
>> On Jun 15, 2016, at 10:13 AM, Dan Burkert > <mailto:d...@cloudera.com>> wrote:
>> 
>> Adding partition splits when range partitioning is done via the 
>> CreateTableOptions.addSplitRow 
>> <http://getkudu.io/apidocs/org/kududb/client/CreateTableOptions.html#addSplitRow-org.kududb.client.PartialRow->
>>  method.  You can find more about the different partitioning options in the 
>> schema design guide 
>> <http://getkudu.io/docs/schema_design.html#data-distribution>.  We generally 
>> recommend sticking to hash partitioning if possible, since you don't have to 
>> determine your own split rows.
>> 
>> - Dan
>> 
>> On Wed, Jun 15, 2016 at 9:17 AM, Benjamin Kim > <mailto:bbuil...@gmail.com>> wrote:
>> Todd,
>> 
>> I think the locality is not within our setup. We have the compute cluster 
>> with Spark, YARN, etc. on its own, and we have the storage cluster with 
>> HBase, Kudu, etc. on another. We beefed up the hardware specs on the compute 
>> cluster and beefed up storage capacity on the storage cluster. We got this 
>> setup idea from the Databricks folks. I do have a question. I created the 
>> table to use range partition on columns. I see that if I use hash partition

Kudu Connector

2016-06-29 Thread Benjamin Kim

I was wondering if anyone, who is a Spark Scala developer, would be willing to 
continue the work done for the Kudu connector?

https://github.com/apache/incubator-kudu/tree/master/java/kudu-spark/src/main/scala/org/kududb/spark/kudu

I have been testing and using Kudu for the past month and comparing against 
HBase. It seems like a promising data store to complement Spark. It fills the 
gap in our company as a fast updatable data store. We stream GB’s of data in 
and run analytical queries against it, which run in well below a minute 
typically. According to the Kudu users group, all it needs is to add SQL (JDBC) 
friendly features (CREATE TABLE, intuitive save modes (append = upsert and 
overwrite = truncate + insert), DELETE, etc.) and improve performance by 
implementing locality.

For reference, here is the page on contributing.

http://kudu.apache.org/docs/contributing.html

I am hoping that for individuals in the Spark community it would be relatively 
easy.

Thanks!
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Performance Question

2016-06-28 Thread Benjamin Kim

FYI.

I did a quick-n-dirty performance test.

First, the setup:
QA cluster:
15 data nodes
64GB memory each
HBase is using 4GB of memory
Kudu is using 1GB of memory
1 HBase/Kudu master node
64GB memory
HBase/Kudu master is using 1GB of memory each
10Gb Ethernet

Using Spark on both to load/read events data (84 columns per row), I was able 
to record performance for each. On the HBase side, I used the Phoenix 4.7 Spark 
plugin where DataFrames can be used directly. On the Kudu side, I used the 
Spark connector. I created an events table in Phoenix using the CREATE TABLE 
statement and created the equivalent in Kudu using the Spark method based off 
of a DataFrame schema.

Here are the numbers for Phoenix/HBase.
1st run:
> 715k rows
- write: 2.7m

> 715k rows in HBase table
- read: 0.1s
- count: 3.8s
- aggregate: 61s

2nd run:
> 5.2M rows
- write: 11m
* had 4 region servers go down, had to retry the 5.2M row write

> 5.9M rows in HBase table
- read: 8s
- count: 3m
- aggregate: 46s

3rd run:
> 6.8M rows
- write: 9.6m

> 12.7M rows
- read: 10s
- count: 3m
- aggregate: 44s


Here are the numbers for Kudu.
1st run:
> 715k rows
- write: 18s

> 715k rows in Kudu table
- read: 0.2s
- count: 18s
- aggregate: 5s

2nd run:
> 5.2M rows
- write: 33s

> 5.9M rows in Kudu table
- read: 0.2s
- count: 16s
- aggregate: 6s

3rd run:
> 6.8M rows
- write: 27s

> 12.7M rows in Kudu table
- read: 0.2s
- count: 16s
- aggregate: 6s

The Kudu results are impressive if you take these number as-is. Kudu is close 
to 18x faster at writing (UPSERT). Kudu is 30x faster at reading (HBase times 
increase as data size grows).  Kudu is 7x faster at full row counts. Lastly, 
Kudu is 3x faster doing an aggregate query (count distinct event_id’s per 
user_id). *Remember that this is small cluster, times are still respectable for 
both systems, HBase could have been configured better, and the HBase table 
could have been better tuned.

Cheers,
Ben


> On Jun 15, 2016, at 10:13 AM, Dan Burkert  wrote:
> 
> Adding partition splits when range partitioning is done via the 
> CreateTableOptions.addSplitRow 
> <http://getkudu.io/apidocs/org/kududb/client/CreateTableOptions.html#addSplitRow-org.kududb.client.PartialRow->
>  method.  You can find more about the different partitioning options in the 
> schema design guide 
> <http://getkudu.io/docs/schema_design.html#data-distribution>.  We generally 
> recommend sticking to hash partitioning if possible, since you don't have to 
> determine your own split rows.
> 
> - Dan
> 
> On Wed, Jun 15, 2016 at 9:17 AM, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> Todd,
> 
> I think the locality is not within our setup. We have the compute cluster 
> with Spark, YARN, etc. on its own, and we have the storage cluster with 
> HBase, Kudu, etc. on another. We beefed up the hardware specs on the compute 
> cluster and beefed up storage capacity on the storage cluster. We got this 
> setup idea from the Databricks folks. I do have a question. I created the 
> table to use range partition on columns. I see that if I use hash partition I 
> can set the number of splits, but how do I do that using range (50 nodes * 10 
> = 500 splits)?
> 
> Thanks,
> Ben
> 
> 
>> On Jun 15, 2016, at 9:11 AM, Todd Lipcon > <mailto:t...@cloudera.com>> wrote:
>> 
>> Awesome use case. One thing to keep in mind is that spark parallelism will 
>> be limited by the number of tablets. So, you might want to split into 10 or 
>> so buckets per node to get the best query throughput.
>> 
>> Usually if you run top on some machines while running the query you can see 
>> if it is fully utilizing the cores.
>> 
>> Another known issue right now is that spark locality isn't working properly 
>> on replicated tables so you will use a lot of network traffic. For a perf 
>> test you might want to try a table with replication count 1
>> 
>> On Jun 15, 2016 5:26 PM, "Benjamin Kim" > <mailto:bbuil...@gmail.com>> wrote:
>> Hi Todd,
>> 
>> I did a simple test of our ad events. We stream using Spark Streaming 
>> directly into HBase, and the Data Analysts/Scientists do some 
>> insight/discovery work plus some reports generation. For the reports, we use 
>> SQL, and the more deeper stuff, we use Spark. In Spark, our main data 
>> currency store of choice is DataFrames.
>> 
>> The schema is around 83 columns wide where most are of the string data type.
>> 
>> "event_type", "timestamp", "event_valid", "event_subtype", "user_ip", 
>> "user_id", "mappable_id",
>> "cookie_status", "profile_status", "user_status&qu

Re: Spark 1.6 (CDH 5.7) and Phoenix 4.7 (CLABS)

2016-06-27 Thread Benjamin Kim

Hi Sean,

I figured out the problem. By putting these jars in the Spark classpath.txt 
file located in Spark conf, this allowed for these to be loaded first. This 
fixed it!

Thanks,
Ben


> On Jun 27, 2016, at 4:20 PM, Sean Busbey  wrote:
> 
> Hi Ben!
> 
> For problems with the Cloudera Labs packaging of Apache Phoenix, you should 
> first seek help on the vendor-specific community forums, to ensure the issue 
> isn't specific to the vendor:
> 
> http://community.cloudera.com/t5/Cloudera-Labs/bd-p/ClouderaLabs
> 
> -busbey
> 
> On 2016-06-27 15:27 (-0500), Benjamin Kim  wrote: 
>> Anyone tried to save a DataFrame to a HBase table using Phoenix? I am able 
>> to load and read, but I canâ?Tt save.
>> 
>>>> spark-shell â?"jars 
>>>> /opt/cloudera/parcels/CLABS_PHOENIX/lib/phoenix/lib/phoenix-spark-4.7.0-clabs-phoenix1.3.0.jar,/opt/cloudera/parcels/CLABS_PHOENIX/lib/phoenix/phoenix-4.7.0-clabs-phoenix1.3.0-client.jar
>> 
>> import org.apache.spark.sql._
>> import org.apache.phoenix.spark._
>> 
>> val hbaseConnectionString = â?oâ?
>> 
>> // Save to OUTPUT_TABLE
>> df.save("org.apache.phoenix.spark", SaveMode.Overwrite, Map("table" -> 
>> "OUTPUT_TABLE",
>>  "zkUrl" -> hbaseConnectionString))
>> 
>> java.lang.ClassNotFoundException: Class 
>> org.apache.phoenix.mapreduce.PhoenixOutputFormat not found
>>  at 
>> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2105)
>>  at 
>> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2197)
>> 
>> Thanks,
>> Ben

Spark 1.6 (CDH 5.7) and Phoenix 4.7 (CLABS)

2016-06-27 Thread Benjamin Kim

Anyone tried to save a DataFrame to a HBase table using Phoenix? I am able to 
load and read, but I can’t save.

>> spark-shell —jars 
>> /opt/cloudera/parcels/CLABS_PHOENIX/lib/phoenix/lib/phoenix-spark-4.7.0-clabs-phoenix1.3.0.jar,/opt/cloudera/parcels/CLABS_PHOENIX/lib/phoenix/phoenix-4.7.0-clabs-phoenix1.3.0-client.jar

import org.apache.spark.sql._
import org.apache.phoenix.spark._

val hbaseConnectionString = “”

// Save to OUTPUT_TABLE
df.save("org.apache.phoenix.spark", SaveMode.Overwrite, Map("table" -> 
"OUTPUT_TABLE",
  "zkUrl" -> hbaseConnectionString))

java.lang.ClassNotFoundException: Class 
org.apache.phoenix.mapreduce.PhoenixOutputFormat not found
at 
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2105)
at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2197)

Thanks,
Ben

livy interpreter not appearing

2016-06-26 Thread Benjamin Kim

Has anyone tried using the livy interpreter? I cannot add it. It just does not 
appear after clicking save.

Thanks,
Ben

Re: phoenix on non-apache hbase

2016-06-25 Thread Benjamin Kim

What a surprise! I see that the phoenix 4.7.0-1.clabs_phoenix1.3.0.p0.000 
parcel has been released by Cloudera. Now, starts the usage tests.

Cheers,
Ben

> On Jun 9, 2016, at 10:09 PM, Ankur Jain  wrote:
> 
> I have updated my jira with updated instructions 
> https://issues.apache.org/jira/browse/PHOENIX-2834 
> <https://issues.apache.org/jira/browse/PHOENIX-2834>.
> 
> Please do let me know if you are able to build and use with CDH5.7
> 
> Thanks,
> Ankur Jain
> 
> From: Andrew Purtell  <mailto:andrew.purt...@gmail.com>>
> Reply-To: "user@phoenix.apache.org <mailto:user@phoenix.apache.org>" 
> mailto:user@phoenix.apache.org>>
> Date: Friday, 10 June 2016 at 9:06 AM
> To: "user@phoenix.apache.org <mailto:user@phoenix.apache.org>" 
> mailto:user@phoenix.apache.org>>
> Subject: Re: phoenix on non-apache hbase
> 
> Yes a stock client should work with a server modified for CDH assuming both 
> client and server versions are within the bounds specified by the backwards 
> compatibility policy (https://phoenix.apache.org/upgrading.html 
> <https://phoenix.apache.org/upgrading.html>)
> 
> "Phoenix maintains backward compatibility across at least two minor releases 
> to allow for no downtime through server-side rolling restarts upon upgrading."
> 
> 
> On Jun 9, 2016, at 8:09 PM, Koert Kuipers  <mailto:ko...@tresata.com>> wrote:
> 
>> is phoenix client also affect by this? or does phoenix server isolate the 
>> client? 
>> 
>> is it reasonable to expect a "stock" phoenix client to work against a custom 
>> phoenix server for cdh 5.x? (with of course the phoenix client and server 
>> having same phoenix version).
>> 
>> 
>> 
>> On Thu, Jun 9, 2016 at 10:55 PM, Benjamin Kim > <mailto:bbuil...@gmail.com>> wrote:
>>> Andrew,
>>> 
>>> Since we are still on CDH 5.5.2, can I just use your custom version? 
>>> Phoenix is one of the reasons that we are blocked from upgrading to CDH 
>>> 5.7.1. Thus, CDH 5.7.1 is only on our test cluster. One of our developers 
>>> wants to try out the Phoenix Spark plugin. Did you try it out in yours too? 
>>> Does it work if you did?
>>> 
>>> Thanks,
>>> Ben
>>> 
>>> 
>>>> On Jun 9, 2016, at 7:47 PM, Andrew Purtell >>> <mailto:andrew.purt...@gmail.com>> wrote:
>>>> 
>>>> >  is cloudera's hbase 1.2.0-cdh5.7.0 that different from apache HBase 
>>>> > 1.2.0?
>>>> 
>>>> Yes
>>>> 
>>>> As is the Cloudera HBase in 5.6, 5.5, 5.4, ... quite different from Apache 
>>>> HBase in coprocessor and RPC internal extension APIs. 
>>>> 
>>>> We have made some ports of Apache Phoenix releases to CDH here: 
>>>> https://github.com/chiastic-security/phoenix-for-cloudera/tree/4.7-HBase-1.0-cdh5.5
>>>>  
>>>> <https://github.com/chiastic-security/phoenix-for-cloudera/tree/4.7-HBase-1.0-cdh5.5?files=1>
>>>>  
>>>> 
>>>> It's a personal project of mine, not something supported by the community. 
>>>> Sounds like I should look at what to do with CDH 5.6 and 5.7. 
>>>> 
>>>> On Jun 9, 2016, at 7:37 PM, Benjamin Kim >>> <mailto:bbuil...@gmail.com>> wrote:
>>>> 
>>>>> This interests me too. I asked Cloudera in their community forums a while 
>>>>> back but got no answer on this. I hope they don’t leave us out in the 
>>>>> cold. I tried building it too before with the instructions here 
>>>>> https://issues.apache.org/jira/browse/PHOENIX-2834 
>>>>> <https://issues.apache.org/jira/browse/PHOENIX-2834>. I could get it to 
>>>>> build, but I couldn’t get it to work using the Phoenix installation 
>>>>> instructions. For some reason, dropping the server jar into CDH 5.7.0 
>>>>> HBase lib directory didn’t change things. HBase seemed not to use it. Now 
>>>>> that this is out, I’ll give it another try hoping that there is a way. If 
>>>>> anyone has any leads to help, please let me know.
>>>>> 
>>>>> Thanks,
>>>>> Ben
>>>>> 
>>>>> 
>>>>>> On Jun 9, 2016, at 6:39 PM, Josh Elser >>>>> <mailto:josh.el...@gmail.com>> wrote:
>>>>>> 
>>>>>> Koert,
>>>>>> 
>>>>>> Apache Phoenix goes through a lot of work to provide mul

Model Quality Tracking

2016-06-24 Thread Benjamin Kim

Has anyone implemented a way to track the performance of a data model? We 
currently have an algorithm to do record linkage and spit out statistics of 
matches, non-matches, and/or partial matches with reason codes of why we didn’t 
match accurately. In this way, we will know if something goes wrong down the 
line. All of this goes into a csv file directories partitioned by datetime with 
a hive table on top. Then, we can do analytical queries and even charting if 
need be. All of this is very manual, but I was wondering if there is a package, 
software, built-in module, etc. that would do this automatically. Since we are 
using CDH, it would be great if these graphs could be integrated into Cloudera 
Manager too.

Any advice is welcome.

Thanks,
Ben


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: admin account in zeppelin

2016-06-21 Thread Benjamin Kim

Chen,

If you don’t mind, how did you integrate LDAP with Zeppelin. As far as I know, 
Shiro was a manual way to setup users and security.

Thanks,
Ben

> On Jun 21, 2016, at 2:44 PM, Chen Song  wrote:
> 
> I am new to Zeppelin and have successfully set up LDAP authentication on 
> zeppelin.
> 
> I also want to restrict write access to interpreters, credentials and 
> configurations to only admin users.
> 
> I added the configurations as per https://github.com/apache/zeppelin/pull/993 
>  and it does hide edit access 
> from other users. However, when I logged in as myUsername, which is supposed 
> to be an admin user, I could edit those 3 things either. Is there anything I 
> miss?
> 
> [users]
> admin = myUsername
> 
> [urls]
> api/version = anon
> /api/interpreter/** = authc, roles[admin]
> /api/configurations/** = authc, roles[admin]
> /api/credential/** = authc, roles[admin]
> 
> Thanks for your feedback.
> 
> -- 
> Chen Song
>

Re: Spark on Kudu

2016-06-20 Thread Benjamin Kim

Dan,

Out of curiosity, I was looking through the spark-csv code in Github and tried 
to see what makes it work for the “CREATE TABLE” statement, while it doesn’t 
for spark-kudu. There are differences in the way both are done, CsvRelation vs. 
KuduRelation. I’m still learning how this works though and what implications 
these differences are. In your opinion, is this the right place to start?

Thanks,
Ben


> On Jun 17, 2016, at 11:08 AM, Dan Burkert  wrote:
> 
> Hi Ben,
> 
> To your first question about `CREATE TABLE` syntax with Kudu/Spark SQL, I do 
> not think we support that at this point.  I haven't looked deeply into it, 
> but we may hit issues specifying Kudu-specific options (partitioning, column 
> encoding, etc.).  Probably issues that can be worked through eventually, 
> though.  If you are interested in contributing to Kudu, this is an area that 
> could obviously use improvement!  Most or all of our Spark features have been 
> completely community driven to date.
>  
> I am assuming that more Spark support along with semantic changes below will 
> be incorporated into Kudu 0.9.1.
> 
> As a rule we do not release new features in patch releases, but the good news 
> is that we are releasing regularly, and our next scheduled release is for the 
> August timeframe (see JD's roadmap 
> <https://lists.apache.org/thread.html/1a3b949e715a74d7f26bd9c102247441a06d16d077324ba39a662e2a@1455234076@%3Cdev.kudu.apache.org%3E>
>  email about what we are aiming to include).  Also, Cloudera does publish 
> snapshot versions of the Spark connector here 
> <https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/>, so the 
> jars are available if you don't mind using snapshots.
>  
> Anyone know of a better way to make unique primary keys other than using UUID 
> to make every row unique if there is no unique column (or combination 
> thereof) to use.
> 
> Not that I know of.  In general it's pretty rare to have a dataset without a 
> natural primary key (even if it's just all of the columns), but in those 
> cases UUID is a good solution.
>  
> This is what I am using. I know auto incrementing is coming down the line 
> (don’t know when), but is there a way to simulate this in Kudu using Spark 
> out of curiosity?
> 
> To my knowledge there is no plan to have auto increment in Kudu.  
> Distributed, consistent, auto incrementing counters is a difficult problem, 
> and I don't think there are any known solutions that would be fast enough for 
> Kudu (happy to be proven wrong, though!).
> 
> - Dan
>  
> 
> Thanks,
> Ben
> 
>> On Jun 14, 2016, at 6:08 PM, Dan Burkert > <mailto:d...@cloudera.com>> wrote:
>> 
>> I'm not sure exactly what the semantics will be, but at least one of them 
>> will be upsert.  These modes come from spark, and they were really designed 
>> for file-backed storage and not table storage.  We may want to do append = 
>> upsert, and overwrite = truncate + insert.  I think that may match the 
>> normal spark semantics more closely.
>> 
>> - Dan
>> 
>> On Tue, Jun 14, 2016 at 6:00 PM, Benjamin Kim > <mailto:bbuil...@gmail.com>> wrote:
>> Dan,
>> 
>> Thanks for the information. That would mean both “append” and “overwrite” 
>> modes would be combined or not needed in the future.
>> 
>> Cheers,
>> Ben
>> 
>>> On Jun 14, 2016, at 5:57 PM, Dan Burkert >> <mailto:d...@cloudera.com>> wrote:
>>> 
>>> Right now append uses an update Kudu operation, which requires the row 
>>> already be present in the table. Overwrite maps to insert.  Kudu very 
>>> recently got upsert support baked in, but it hasn't yet been integrated 
>>> into the Spark connector.  So pretty soon these sharp edges will get a lot 
>>> better, since upsert is the way to go for most spark workloads.
>>> 
>>> - Dan
>>> 
>>> On Tue, Jun 14, 2016 at 5:41 PM, Benjamin Kim >> <mailto:bbuil...@gmail.com>> wrote:
>>> I tried to use the “append” mode, and it worked. Over 3.8 million rows in 
>>> 64s. I would assume that now I can use the “overwrite” mode on existing 
>>> data. Now, I have to find answers to these questions. What would happen if 
>>> I “append” to the data in the Kudu table if the data already exists? What 
>>> would happen if I “overwrite” existing data when the DataFrame has data in 
>>> it that does not exist in the Kudu table? I need to evaluate the best way 
>>> to simulate the UPSERT behavior in HBase because this is what our use case 
>>> is.
>>> 
>>> Thanks

Re: Github Integration

2016-06-17 Thread Benjamin Kim

I will try this. Do you have any information on how to do the checkpoints? Is 
there any additional setups?

Thanks,
Ben

> On Jun 9, 2016, at 2:06 PM, Khalid Huseynov  wrote:
> 
> As it was mentioned, existing "git" repository is local and setup is 
> described here 
> <https://zeppelin.incubator.apache.org/docs/0.6.0-SNAPSHOT/storage/storage.html#Git>.
>  
> Once setup, you can also do checkpoints (commits) from Zeppelin version 
> control menu with your commit message.
> 
> On Thu, Jun 9, 2016 at 9:17 AM, Jeff Steinmetz  <mailto:jeffrey.steinm...@gmail.com>> wrote:
> I believe it actually uses a local “git” repository, not necessarily “github”
> If you want it to sync to origin (stash), you could set up a `git push` cron 
> job on a schedule.
> 
> 
> 
> On 6/9/16, 8:40 AM, "Benjamin Kim"  <mailto:bbuil...@gmail.com>> wrote:
> 
> >I heard that Zeppelin 0.6.0 is able to use its local notebook directory as a 
> >Github repo. Does anyone know of a way to have it work (workaround) with our 
> >company’s Github (Stash) repo server?
> >
> >Any advice would be welcome.
> >
> >Thanks,
> >Ben
> 
>

Data Integrity / Model Quality Monitoring

2016-06-17 Thread Benjamin Kim

Has anyone run into this requirement?

We have a need to track data integrity and model quality metrics of outcomes so 
that we can both gauge if the data is healthy coming in and the models run 
against them are still performing and not giving faulty results. A nice to have 
would be to graph these over time somehow. Since we are using Cloudera Manager, 
graphing in there would be a plus.

Any advice or suggestions would be welcome.

Thanks,
Ben
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark on Kudu

2016-06-17 Thread Benjamin Kim

Dan,

The roadmap is very informative. I am looking forward to the official 1.0 
release! It would be so much easier for us to use in every aspect compared to 
HBase.

Cheers,
Ben


> On Jun 17, 2016, at 11:08 AM, Dan Burkert  wrote:
> 
> Hi Ben,
> 
> To your first question about `CREATE TABLE` syntax with Kudu/Spark SQL, I do 
> not think we support that at this point.  I haven't looked deeply into it, 
> but we may hit issues specifying Kudu-specific options (partitioning, column 
> encoding, etc.).  Probably issues that can be worked through eventually, 
> though.  If you are interested in contributing to Kudu, this is an area that 
> could obviously use improvement!  Most or all of our Spark features have been 
> completely community driven to date.
>  
> I am assuming that more Spark support along with semantic changes below will 
> be incorporated into Kudu 0.9.1.
> 
> As a rule we do not release new features in patch releases, but the good news 
> is that we are releasing regularly, and our next scheduled release is for the 
> August timeframe (see JD's roadmap 
> <https://lists.apache.org/thread.html/1a3b949e715a74d7f26bd9c102247441a06d16d077324ba39a662e2a@1455234076@%3Cdev.kudu.apache.org%3E>
>  email about what we are aiming to include).  Also, Cloudera does publish 
> snapshot versions of the Spark connector here 
> <https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/>, so the 
> jars are available if you don't mind using snapshots.
>  
> Anyone know of a better way to make unique primary keys other than using UUID 
> to make every row unique if there is no unique column (or combination 
> thereof) to use.
> 
> Not that I know of.  In general it's pretty rare to have a dataset without a 
> natural primary key (even if it's just all of the columns), but in those 
> cases UUID is a good solution.
>  
> This is what I am using. I know auto incrementing is coming down the line 
> (don’t know when), but is there a way to simulate this in Kudu using Spark 
> out of curiosity?
> 
> To my knowledge there is no plan to have auto increment in Kudu.  
> Distributed, consistent, auto incrementing counters is a difficult problem, 
> and I don't think there are any known solutions that would be fast enough for 
> Kudu (happy to be proven wrong, though!).
> 
> - Dan
>  
> 
> Thanks,
> Ben
> 
>> On Jun 14, 2016, at 6:08 PM, Dan Burkert > <mailto:d...@cloudera.com>> wrote:
>> 
>> I'm not sure exactly what the semantics will be, but at least one of them 
>> will be upsert.  These modes come from spark, and they were really designed 
>> for file-backed storage and not table storage.  We may want to do append = 
>> upsert, and overwrite = truncate + insert.  I think that may match the 
>> normal spark semantics more closely.
>> 
>> - Dan
>> 
>> On Tue, Jun 14, 2016 at 6:00 PM, Benjamin Kim > <mailto:bbuil...@gmail.com>> wrote:
>> Dan,
>> 
>> Thanks for the information. That would mean both “append” and “overwrite” 
>> modes would be combined or not needed in the future.
>> 
>> Cheers,
>> Ben
>> 
>>> On Jun 14, 2016, at 5:57 PM, Dan Burkert >> <mailto:d...@cloudera.com>> wrote:
>>> 
>>> Right now append uses an update Kudu operation, which requires the row 
>>> already be present in the table. Overwrite maps to insert.  Kudu very 
>>> recently got upsert support baked in, but it hasn't yet been integrated 
>>> into the Spark connector.  So pretty soon these sharp edges will get a lot 
>>> better, since upsert is the way to go for most spark workloads.
>>> 
>>> - Dan
>>> 
>>> On Tue, Jun 14, 2016 at 5:41 PM, Benjamin Kim >> <mailto:bbuil...@gmail.com>> wrote:
>>> I tried to use the “append” mode, and it worked. Over 3.8 million rows in 
>>> 64s. I would assume that now I can use the “overwrite” mode on existing 
>>> data. Now, I have to find answers to these questions. What would happen if 
>>> I “append” to the data in the Kudu table if the data already exists? What 
>>> would happen if I “overwrite” existing data when the DataFrame has data in 
>>> it that does not exist in the Kudu table? I need to evaluate the best way 
>>> to simulate the UPSERT behavior in HBase because this is what our use case 
>>> is.
>>> 
>>> Thanks,
>>> Ben
>>> 
>>> 
>>> 
>>>> On Jun 14, 2016, at 5:05 PM, Benjamin Kim >>> <mailto:bbuil...@gmail.com>> wrote:
>>>> 
>>>> Hi,
>>>>

Re: Spark on Kudu

2016-06-17 Thread Benjamin Kim

I am assuming that more Spark support along with semantic changes below will be 
incorporated into Kudu 0.9.1.

Anyone know of a better way to make unique primary keys other than using UUID 
to make every row unique if there is no unique column (or combination thereof) 
to use.

import java.util.UUID
val generateUUID = udf(() => UUID.randomUUID().toString)

This is what I am using. I know auto incrementing is coming down the line 
(don’t know when), but is there a way to simulate this in Kudu using Spark out 
of curiosity?

Thanks,
Ben

> On Jun 14, 2016, at 6:08 PM, Dan Burkert  wrote:
> 
> I'm not sure exactly what the semantics will be, but at least one of them 
> will be upsert.  These modes come from spark, and they were really designed 
> for file-backed storage and not table storage.  We may want to do append = 
> upsert, and overwrite = truncate + insert.  I think that may match the normal 
> spark semantics more closely.
> 
> - Dan
> 
> On Tue, Jun 14, 2016 at 6:00 PM, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> Dan,
> 
> Thanks for the information. That would mean both “append” and “overwrite” 
> modes would be combined or not needed in the future.
> 
> Cheers,
> Ben
> 
>> On Jun 14, 2016, at 5:57 PM, Dan Burkert > <mailto:d...@cloudera.com>> wrote:
>> 
>> Right now append uses an update Kudu operation, which requires the row 
>> already be present in the table. Overwrite maps to insert.  Kudu very 
>> recently got upsert support baked in, but it hasn't yet been integrated into 
>> the Spark connector.  So pretty soon these sharp edges will get a lot 
>> better, since upsert is the way to go for most spark workloads.
>> 
>> - Dan
>> 
>> On Tue, Jun 14, 2016 at 5:41 PM, Benjamin Kim > <mailto:bbuil...@gmail.com>> wrote:
>> I tried to use the “append” mode, and it worked. Over 3.8 million rows in 
>> 64s. I would assume that now I can use the “overwrite” mode on existing 
>> data. Now, I have to find answers to these questions. What would happen if I 
>> “append” to the data in the Kudu table if the data already exists? What 
>> would happen if I “overwrite” existing data when the DataFrame has data in 
>> it that does not exist in the Kudu table? I need to evaluate the best way to 
>> simulate the UPSERT behavior in HBase because this is what our use case is.
>> 
>> Thanks,
>> Ben
>> 
>> 
>> 
>>> On Jun 14, 2016, at 5:05 PM, Benjamin Kim >> <mailto:bbuil...@gmail.com>> wrote:
>>> 
>>> Hi,
>>> 
>>> Now, I’m getting this error when trying to write to the table.
>>> 
>>> import scala.collection.JavaConverters._
>>> val key_seq = Seq(“my_id")
>>> val key_list = List(“my_id”).asJava
>>> kuduContext.createTable(tableName, df.schema, key_seq, new 
>>> CreateTableOptions().setNumReplicas(1).addHashPartitions(key_list, 100))
>>> 
>>> df.write
>>> .options(Map("kudu.master" -> kuduMaster,"kudu.table" -> tableName))
>>> .mode("overwrite")
>>> .kudu
>>> 
>>> java.lang.RuntimeException: failed to write 1000 rows from DataFrame to 
>>> Kudu; sample errors: Not found: key not found (error 0)Not found: key not 
>>> found (error 0)Not found: key not found (error 0)Not found: key not found 
>>> (error 0)Not found: key not found (error 0)
>>> 
>>> Does the key field need to be first in the DataFrame?
>>> 
>>> Thanks,
>>> Ben
>>> 
>>>> On Jun 14, 2016, at 4:28 PM, Dan Burkert >>> <mailto:d...@cloudera.com>> wrote:
>>>> 
>>>> 
>>>> 
>>>> On Tue, Jun 14, 2016 at 4:20 PM, Benjamin Kim >>> <mailto:bbuil...@gmail.com>> wrote:
>>>> Dan,
>>>> 
>>>> Thanks! It got further. Now, how do I set the Primary Key to be a 
>>>> column(s) in the DataFrame and set the partitioning? Is it like this?
>>>> 
>>>> kuduContext.createTable(tableName, df.schema, Seq(“my_id"), new 
>>>> CreateTableOptions().setNumReplicas(1).addHashPartitions(“my_id"))
>>>> 
>>>> java.lang.IllegalArgumentException: Table partitioning must be specified 
>>>> using setRangePartitionColumns or addHashPartitions
>>>> 
>>>> Yep.  The `Seq("my_id")` part of that call is specifying the set of 
>>>> primary key columns, so in this case you have specified the single PK 
>>>> column "my_id".  The `addHashPartitions` call adds

Re: Ask opinion regarding 0.6.0 release package

2016-06-17 Thread Benjamin Kim

Hi,

Our company’s use is spark, phoenix, jdbc/psql. So, if you make different 
packages, I would need the full one. In addition, for the minimized one, would 
there be a way to pick and choose interpreters to add/plug in?

Thanks,
Ben

> On Jun 17, 2016, at 1:02 AM, mina lee  wrote:
> 
> Hi all!
> 
> Zeppelin just started release process. Prior to creating release candidate I 
> want to ask users' opinion about how you want it to be packaged.
> 
> For the last release(0.5.6), we have released one binary package which 
> includes all interpreters.
> The concern with providing one type of binary package is that package size 
> will be quite big(~600MB).
> So I am planning to provide two binary packages:
>   - zeppelin-0.6.0-bin-all.tgz (includes all interpreters)
>   - zeppelin-0.6.0-bin-min.tgz (includes only most used interpreters)
> 
> I am thinking about putting spark(pyspark, sparkr, sql), python, jdbc, shell, 
> markdown, angular in minimized package.
> Could you give your opinion on whether these sets are enough, or some of them 
> are ok to be excluded?
> 
> Community's opinion will be helpful to make decision not only for 0.6.0 but 
> also for 0.7.0 release since we are planning to provide only minimized 
> package from 0.7.0 release. From the 0.7.0 version, interpreters those are 
> not included in binary package will be able to use dynamic interpreter 
> feature [1] which is in progress under [2].
> 
> Thanks,
> Mina
> 
> [1] 
> http://zeppelin.apache.org/docs/0.6.0-SNAPSHOT/manual/dynamicinterpreterload.html
>  
> 
> [2] https://github.com/apache/zeppelin/pull/908 
>

Re: Spark on Kudu

2016-06-15 Thread Benjamin Kim

Since I have created permanent tables using org.apache.spark.sql.jdbc and 
com.databricks.spark.csv with sqlContext, I was wondering if I can do the same 
with Kudu tables?

CREATE TABLE 
USING org.kududb.spark.kudu
OPTIONS ("kudu.master” "kudu_master","kudu.table” "kudu_tablename”)

Is this possible? By the way, the above didn’t work for me.

Thanks,
Ben

> On Jun 14, 2016, at 6:08 PM, Dan Burkert  wrote:
> 
> I'm not sure exactly what the semantics will be, but at least one of them 
> will be upsert.  These modes come from spark, and they were really designed 
> for file-backed storage and not table storage.  We may want to do append = 
> upsert, and overwrite = truncate + insert.  I think that may match the normal 
> spark semantics more closely.
> 
> - Dan
> 
> On Tue, Jun 14, 2016 at 6:00 PM, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> Dan,
> 
> Thanks for the information. That would mean both “append” and “overwrite” 
> modes would be combined or not needed in the future.
> 
> Cheers,
> Ben
> 
>> On Jun 14, 2016, at 5:57 PM, Dan Burkert > <mailto:d...@cloudera.com>> wrote:
>> 
>> Right now append uses an update Kudu operation, which requires the row 
>> already be present in the table. Overwrite maps to insert.  Kudu very 
>> recently got upsert support baked in, but it hasn't yet been integrated into 
>> the Spark connector.  So pretty soon these sharp edges will get a lot 
>> better, since upsert is the way to go for most spark workloads.
>> 
>> - Dan
>> 
>> On Tue, Jun 14, 2016 at 5:41 PM, Benjamin Kim > <mailto:bbuil...@gmail.com>> wrote:
>> I tried to use the “append” mode, and it worked. Over 3.8 million rows in 
>> 64s. I would assume that now I can use the “overwrite” mode on existing 
>> data. Now, I have to find answers to these questions. What would happen if I 
>> “append” to the data in the Kudu table if the data already exists? What 
>> would happen if I “overwrite” existing data when the DataFrame has data in 
>> it that does not exist in the Kudu table? I need to evaluate the best way to 
>> simulate the UPSERT behavior in HBase because this is what our use case is.
>> 
>> Thanks,
>> Ben
>> 
>> 
>> 
>>> On Jun 14, 2016, at 5:05 PM, Benjamin Kim >> <mailto:bbuil...@gmail.com>> wrote:
>>> 
>>> Hi,
>>> 
>>> Now, I’m getting this error when trying to write to the table.
>>> 
>>> import scala.collection.JavaConverters._
>>> val key_seq = Seq(“my_id")
>>> val key_list = List(“my_id”).asJava
>>> kuduContext.createTable(tableName, df.schema, key_seq, new 
>>> CreateTableOptions().setNumReplicas(1).addHashPartitions(key_list, 100))
>>> 
>>> df.write
>>> .options(Map("kudu.master" -> kuduMaster,"kudu.table" -> tableName))
>>> .mode("overwrite")
>>> .kudu
>>> 
>>> java.lang.RuntimeException: failed to write 1000 rows from DataFrame to 
>>> Kudu; sample errors: Not found: key not found (error 0)Not found: key not 
>>> found (error 0)Not found: key not found (error 0)Not found: key not found 
>>> (error 0)Not found: key not found (error 0)
>>> 
>>> Does the key field need to be first in the DataFrame?
>>> 
>>> Thanks,
>>> Ben
>>> 
>>>> On Jun 14, 2016, at 4:28 PM, Dan Burkert >>> <mailto:d...@cloudera.com>> wrote:
>>>> 
>>>> 
>>>> 
>>>> On Tue, Jun 14, 2016 at 4:20 PM, Benjamin Kim >>> <mailto:bbuil...@gmail.com>> wrote:
>>>> Dan,
>>>> 
>>>> Thanks! It got further. Now, how do I set the Primary Key to be a 
>>>> column(s) in the DataFrame and set the partitioning? Is it like this?
>>>> 
>>>> kuduContext.createTable(tableName, df.schema, Seq(“my_id"), new 
>>>> CreateTableOptions().setNumReplicas(1).addHashPartitions(“my_id"))
>>>> 
>>>> java.lang.IllegalArgumentException: Table partitioning must be specified 
>>>> using setRangePartitionColumns or addHashPartitions
>>>> 
>>>> Yep.  The `Seq("my_id")` part of that call is specifying the set of 
>>>> primary key columns, so in this case you have specified the single PK 
>>>> column "my_id".  The `addHashPartitions` call adds hash partitioning to 
>>>> the table, in this case over the column "my_id" (which is good, it must be 
>>>> over one or mor

Re: Performance Question

2016-06-15 Thread Benjamin Kim

Todd,

I think the locality is not within our setup. We have the compute cluster with 
Spark, YARN, etc. on its own, and we have the storage cluster with HBase, Kudu, 
etc. on another. We beefed up the hardware specs on the compute cluster and 
beefed up storage capacity on the storage cluster. We got this setup idea from 
the Databricks folks. I do have a question. I created the table to use range 
partition on columns. I see that if I use hash partition I can set the number 
of splits, but how do I do that using range (50 nodes * 10 = 500 splits)?

Thanks,
Ben

> On Jun 15, 2016, at 9:11 AM, Todd Lipcon  wrote:
> 
> Awesome use case. One thing to keep in mind is that spark parallelism will be 
> limited by the number of tablets. So, you might want to split into 10 or so 
> buckets per node to get the best query throughput.
> 
> Usually if you run top on some machines while running the query you can see 
> if it is fully utilizing the cores.
> 
> Another known issue right now is that spark locality isn't working properly 
> on replicated tables so you will use a lot of network traffic. For a perf 
> test you might want to try a table with replication count 1
> 
> On Jun 15, 2016 5:26 PM, "Benjamin Kim"  <mailto:bbuil...@gmail.com>> wrote:
> Hi Todd,
> 
> I did a simple test of our ad events. We stream using Spark Streaming 
> directly into HBase, and the Data Analysts/Scientists do some 
> insight/discovery work plus some reports generation. For the reports, we use 
> SQL, and the more deeper stuff, we use Spark. In Spark, our main data 
> currency store of choice is DataFrames.
> 
> The schema is around 83 columns wide where most are of the string data type.
> 
> "event_type", "timestamp", "event_valid", "event_subtype", "user_ip", 
> "user_id", "mappable_id",
> "cookie_status", "profile_status", "user_status", "previous_timestamp", 
> "user_agent", "referer",
> "host_domain", "uri", "request_elapsed", "browser_languages", "acamp_id", 
> "creative_id",
> "location_id", “pcamp_id",
> "pdomain_id", "continent_code", "country", "region", "dma", "city", "zip", 
> "isp", "line_speed",
> "gender", "year_of_birth", "behaviors_read", "behaviors_written", 
> "key_value_pairs", "acamp_candidates",
> "tag_format", "optimizer_name", "optimizer_version", "optimizer_ip", 
> "pixel_id", “video_id",
> "video_network_id", "video_time_watched", "video_percentage_watched", 
> "video_media_type",
> "video_player_iframed", "video_player_in_view", "video_player_width", 
> "video_player_height",
> "conversion_valid_sale", "conversion_sale_amount", 
> "conversion_commission_amount", "conversion_step",
> "conversion_currency", "conversion_attribution", "conversion_offer_id", 
> "custom_info", "frequency",
> "recency_seconds", "cost", "revenue", “optimizer_acamp_id",
> "optimizer_creative_id", "optimizer_ecpm", "impression_id", "diagnostic_data",
> "user_profile_mapping_source", "latitude", "longitude", "area_code", 
> "gmt_offset", "in_dst",
> "proxy_type", "mobile_carrier", "pop", "hostname", "profile_expires", 
> "timestamp_iso", "reference_id",
> "identity_organization", "identity_method"
> 
> Most queries are like counts of how many users use what browser, how many are 
> unique users, etc. The part that scares most users is when it comes to 
> joining this data with other dimension/3rd party events tables because of 
> shear size of it.
> 
> We do what most companies do, similar to what I saw in earlier presentations 
> of Kudu. We dump data out of HBase into partitioned Parquet tables to make 
> query performance manageable.
> 
> I will coordinate with a data scientist today to do some tests. He is working 
> on identity matching/record linking of users from 2 domains: US and 
> Singapore, using probabilistic deduping algorithms. I will load the data from 
> ad events from both countries, and let him run his process against this data 
> in Kudu. I hope this will “wow” the team.
> 
> Thanks,
> Ben
>

Re: Performance Question

2016-06-15 Thread Benjamin Kim

Hi Todd,

I did a simple test of our ad events. We stream using Spark Streaming directly 
into HBase, and the Data Analysts/Scientists do some insight/discovery work 
plus some reports generation. For the reports, we use SQL, and the more deeper 
stuff, we use Spark. In Spark, our main data currency store of choice is 
DataFrames.

The schema is around 83 columns wide where most are of the string data type.

"event_type", "timestamp", "event_valid", "event_subtype", "user_ip", 
"user_id", "mappable_id",
"cookie_status", "profile_status", "user_status", "previous_timestamp", 
"user_agent", "referer",
"host_domain", "uri", "request_elapsed", "browser_languages", "acamp_id", 
"creative_id",
"location_id", “pcamp_id",
"pdomain_id", "continent_code", "country", "region", "dma", "city", "zip", 
"isp", "line_speed",
"gender", "year_of_birth", "behaviors_read", "behaviors_written", 
"key_value_pairs", "acamp_candidates",
"tag_format", "optimizer_name", "optimizer_version", "optimizer_ip", 
"pixel_id", “video_id",
"video_network_id", "video_time_watched", "video_percentage_watched", 
"video_media_type",
"video_player_iframed", "video_player_in_view", "video_player_width", 
"video_player_height",
"conversion_valid_sale", "conversion_sale_amount", 
"conversion_commission_amount", "conversion_step",
"conversion_currency", "conversion_attribution", "conversion_offer_id", 
"custom_info", "frequency",
"recency_seconds", "cost", "revenue", “optimizer_acamp_id",
"optimizer_creative_id", "optimizer_ecpm", "impression_id", "diagnostic_data",
"user_profile_mapping_source", "latitude", "longitude", "area_code", 
"gmt_offset", "in_dst",
"proxy_type", "mobile_carrier", "pop", "hostname", "profile_expires", 
"timestamp_iso", "reference_id",
"identity_organization", "identity_method"

Most queries are like counts of how many users use what browser, how many are 
unique users, etc. The part that scares most users is when it comes to joining 
this data with other dimension/3rd party events tables because of shear size of 
it.

We do what most companies do, similar to what I saw in earlier presentations of 
Kudu. We dump data out of HBase into partitioned Parquet tables to make query 
performance manageable.

I will coordinate with a data scientist today to do some tests. He is working 
on identity matching/record linking of users from 2 domains: US and Singapore, 
using probabilistic deduping algorithms. I will load the data from ad events 
from both countries, and let him run his process against this data in Kudu. I 
hope this will “wow” the team.

Thanks,
Ben

> On Jun 15, 2016, at 12:47 AM, Todd Lipcon  wrote:
> 
> Hi Benjamin,
> 
> What workload are you using for benchmarks? Using spark or something more 
> custom? rdd or data frame or SQL, etc? Maybe you can share the schema and 
> some queries
> 
> Todd
> 
> Todd
> 
> On Jun 15, 2016 8:10 AM, "Benjamin Kim"  <mailto:bbuil...@gmail.com>> wrote:
> Hi Todd,
> 
> Now that Kudu 0.9.0 is out. I have done some tests. Already, I am impressed. 
> Compared to HBase, read and write performance are better. Write performance 
> has the greatest improvement (> 4x), while read is > 1.5x. Albeit, these are 
> only preliminary tests. Do you know of a way to really do some conclusive 
> tests? I want to see if I can match your results on my 50 node cluster.
> 
> Thanks,
> Ben
> 
>> On May 30, 2016, at 10:33 AM, Todd Lipcon > <mailto:t...@cloudera.com>> wrote:
>> 
>> On Sat, May 28, 2016 at 7:12 AM, Benjamin Kim > <mailto:bbuil...@gmail.com>> wrote:
>> Todd,
>> 
>> It sounds like Kudu can possibly top or match those numbers put out by 
>> Aerospike. Do you have any performance statistics published or any 
>> instructions as to measure them myself as good way to test? In addition, 
>> this will be a test using Spark, so should I wait for Kudu version 0.9.0 
>> where support will be built in?
>> 
>> We don't have a lot of benchmarks published yet, especially on the write 
>> side. I've found that thor

Re: Performance Question

2016-06-14 Thread Benjamin Kim

Hi Todd,

Now that Kudu 0.9.0 is out. I have done some tests. Already, I am impressed. 
Compared to HBase, read and write performance are better. Write performance has 
the greatest improvement (> 4x), while read is > 1.5x. Albeit, these are only 
preliminary tests. Do you know of a way to really do some conclusive tests? I 
want to see if I can match your results on my 50 node cluster.

Thanks,
Ben

> On May 30, 2016, at 10:33 AM, Todd Lipcon  wrote:
> 
> On Sat, May 28, 2016 at 7:12 AM, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> Todd,
> 
> It sounds like Kudu can possibly top or match those numbers put out by 
> Aerospike. Do you have any performance statistics published or any 
> instructions as to measure them myself as good way to test? In addition, this 
> will be a test using Spark, so should I wait for Kudu version 0.9.0 where 
> support will be built in?
> 
> We don't have a lot of benchmarks published yet, especially on the write 
> side. I've found that thorough cross-system benchmarks are very difficult to 
> do fairly and accurately, and often times users end up misguided if they pay 
> too much attention to them :) So, given a finite number of developers working 
> on Kudu, I think we've tended to spend more time on the project itself and 
> less time focusing on "competition". I'm sure there are use cases where Kudu 
> will beat out Aerospike, and probably use cases where Aerospike will beat 
> Kudu as well.
> 
> From my perspective, it would be great if you can share some details of your 
> workload, especially if there are some areas you're finding Kudu lacking. 
> Maybe we can spot some easy code changes we could make to improve 
> performance, or suggest a tuning variable you could change.
> 
> -Todd
> 
> 
>> On May 27, 2016, at 9:19 PM, Todd Lipcon > <mailto:t...@cloudera.com>> wrote:
>> 
>> On Fri, May 27, 2016 at 8:20 PM, Benjamin Kim > <mailto:bbuil...@gmail.com>> wrote:
>> Hi Mike,
>> 
>> First of all, thanks for the link. It looks like an interesting read. I 
>> checked that Aerospike is currently at version 3.8.2.3, and in the article, 
>> they are evaluating version 3.5.4. The main thing that impressed me was 
>> their claim that they can beat Cassandra and HBase by 8x for writing and 25x 
>> for reading. Their big claim to fame is that Aerospike can write 1M records 
>> per second with only 50 nodes. I wanted to see if this is real.
>> 
>> 1M records per second on 50 nodes is pretty doable by Kudu as well, 
>> depending on the size of your records and the insertion order. I've been 
>> playing with a ~70 node cluster recently and seen 1M+ writes/second 
>> sustained, and bursting above 4M. These are 1KB rows with 11 columns, and 
>> with pretty old HDD-only nodes. I think newer flash-based nodes could do 
>> better.
>>  
>> 
>> To answer your questions, we have a DMP with user profiles with many 
>> attributes. We create segmentation information off of these attributes to 
>> classify them. Then, we can target advertising appropriately for our sales 
>> department. Much of the data processing is for applying models on all or if 
>> not most of every profile’s attributes to find similarities (nearest 
>> neighbor/clustering) over a large number of rows when batch processing or a 
>> small subset of rows for quick online scoring. So, our use case is a typical 
>> advanced analytics scenario. We have tried HBase, but it doesn’t work well 
>> for these types of analytics.
>> 
>> I read, that Aerospike in the release notes, they did do many improvements 
>> for batch and scan operations.
>> 
>> I wonder what your thoughts are for using Kudu for this.
>> 
>> Sounds like a good Kudu use case to me. I've heard great things about 
>> Aerospike for the low latency random access portion, but I've also heard 
>> that it's _very_ expensive, and not particularly suited to the columnar scan 
>> workload. Lastly, I think the Apache license of Kudu is much more appealing 
>> than the AGPL3 used by Aerospike. But, that's not really a direct answer to 
>> the performance question :)
>>  
>> 
>> Thanks,
>> Ben
>> 
>> 
>>> On May 27, 2016, at 6:21 PM, Mike Percy >> <mailto:mpe...@cloudera.com>> wrote:
>>> 
>>> Have you considered whether you have a scan heavy or a random access heavy 
>>> workload? Have you considered whether you always access / update a whole 
>>> row vs only a partial row? Kudu is a column store so has some awesome 
>>> performance characteristi

Re: Spark on Kudu

2016-06-14 Thread Benjamin Kim

Ah, that makes more sense when you put it that way.

Thanks,
Ben

> On Jun 14, 2016, at 6:08 PM, Dan Burkert  wrote:
> 
> I'm not sure exactly what the semantics will be, but at least one of them 
> will be upsert.  These modes come from spark, and they were really designed 
> for file-backed storage and not table storage.  We may want to do append = 
> upsert, and overwrite = truncate + insert.  I think that may match the normal 
> spark semantics more closely.
> 
> - Dan
> 
> On Tue, Jun 14, 2016 at 6:00 PM, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> Dan,
> 
> Thanks for the information. That would mean both “append” and “overwrite” 
> modes would be combined or not needed in the future.
> 
> Cheers,
> Ben
> 
>> On Jun 14, 2016, at 5:57 PM, Dan Burkert > <mailto:d...@cloudera.com>> wrote:
>> 
>> Right now append uses an update Kudu operation, which requires the row 
>> already be present in the table. Overwrite maps to insert.  Kudu very 
>> recently got upsert support baked in, but it hasn't yet been integrated into 
>> the Spark connector.  So pretty soon these sharp edges will get a lot 
>> better, since upsert is the way to go for most spark workloads.
>> 
>> - Dan
>> 
>> On Tue, Jun 14, 2016 at 5:41 PM, Benjamin Kim > <mailto:bbuil...@gmail.com>> wrote:
>> I tried to use the “append” mode, and it worked. Over 3.8 million rows in 
>> 64s. I would assume that now I can use the “overwrite” mode on existing 
>> data. Now, I have to find answers to these questions. What would happen if I 
>> “append” to the data in the Kudu table if the data already exists? What 
>> would happen if I “overwrite” existing data when the DataFrame has data in 
>> it that does not exist in the Kudu table? I need to evaluate the best way to 
>> simulate the UPSERT behavior in HBase because this is what our use case is.
>> 
>> Thanks,
>> Ben
>> 
>> 
>> 
>>> On Jun 14, 2016, at 5:05 PM, Benjamin Kim >> <mailto:bbuil...@gmail.com>> wrote:
>>> 
>>> Hi,
>>> 
>>> Now, I’m getting this error when trying to write to the table.
>>> 
>>> import scala.collection.JavaConverters._
>>> val key_seq = Seq(“my_id")
>>> val key_list = List(“my_id”).asJava
>>> kuduContext.createTable(tableName, df.schema, key_seq, new 
>>> CreateTableOptions().setNumReplicas(1).addHashPartitions(key_list, 100))
>>> 
>>> df.write
>>> .options(Map("kudu.master" -> kuduMaster,"kudu.table" -> tableName))
>>> .mode("overwrite")
>>> .kudu
>>> 
>>> java.lang.RuntimeException: failed to write 1000 rows from DataFrame to 
>>> Kudu; sample errors: Not found: key not found (error 0)Not found: key not 
>>> found (error 0)Not found: key not found (error 0)Not found: key not found 
>>> (error 0)Not found: key not found (error 0)
>>> 
>>> Does the key field need to be first in the DataFrame?
>>> 
>>> Thanks,
>>> Ben
>>> 
>>>> On Jun 14, 2016, at 4:28 PM, Dan Burkert >>> <mailto:d...@cloudera.com>> wrote:
>>>> 
>>>> 
>>>> 
>>>> On Tue, Jun 14, 2016 at 4:20 PM, Benjamin Kim >>> <mailto:bbuil...@gmail.com>> wrote:
>>>> Dan,
>>>> 
>>>> Thanks! It got further. Now, how do I set the Primary Key to be a 
>>>> column(s) in the DataFrame and set the partitioning? Is it like this?
>>>> 
>>>> kuduContext.createTable(tableName, df.schema, Seq(“my_id"), new 
>>>> CreateTableOptions().setNumReplicas(1).addHashPartitions(“my_id"))
>>>> 
>>>> java.lang.IllegalArgumentException: Table partitioning must be specified 
>>>> using setRangePartitionColumns or addHashPartitions
>>>> 
>>>> Yep.  The `Seq("my_id")` part of that call is specifying the set of 
>>>> primary key columns, so in this case you have specified the single PK 
>>>> column "my_id".  The `addHashPartitions` call adds hash partitioning to 
>>>> the table, in this case over the column "my_id" (which is good, it must be 
>>>> over one or more PK columns, so in this case "my_id" is the one and only 
>>>> valid combination).  However, the call to `addHashPartition` also takes 
>>>> the number of buckets as the second param.  You shouldn't get the 
>>>> IllegalArgumentException as long as you are specify

Re: Spark on Kudu

2016-06-14 Thread Benjamin Kim

Dan,

Thanks for the information. That would mean both “append” and “overwrite” modes 
would be combined or not needed in the future.

Cheers,
Ben

> On Jun 14, 2016, at 5:57 PM, Dan Burkert  wrote:
> 
> Right now append uses an update Kudu operation, which requires the row 
> already be present in the table. Overwrite maps to insert.  Kudu very 
> recently got upsert support baked in, but it hasn't yet been integrated into 
> the Spark connector.  So pretty soon these sharp edges will get a lot better, 
> since upsert is the way to go for most spark workloads.
> 
> - Dan
> 
> On Tue, Jun 14, 2016 at 5:41 PM, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> I tried to use the “append” mode, and it worked. Over 3.8 million rows in 
> 64s. I would assume that now I can use the “overwrite” mode on existing data. 
> Now, I have to find answers to these questions. What would happen if I 
> “append” to the data in the Kudu table if the data already exists? What would 
> happen if I “overwrite” existing data when the DataFrame has data in it that 
> does not exist in the Kudu table? I need to evaluate the best way to simulate 
> the UPSERT behavior in HBase because this is what our use case is.
> 
> Thanks,
> Ben
> 
> 
> 
>> On Jun 14, 2016, at 5:05 PM, Benjamin Kim > <mailto:bbuil...@gmail.com>> wrote:
>> 
>> Hi,
>> 
>> Now, I’m getting this error when trying to write to the table.
>> 
>> import scala.collection.JavaConverters._
>> val key_seq = Seq(“my_id")
>> val key_list = List(“my_id”).asJava
>> kuduContext.createTable(tableName, df.schema, key_seq, new 
>> CreateTableOptions().setNumReplicas(1).addHashPartitions(key_list, 100))
>> 
>> df.write
>> .options(Map("kudu.master" -> kuduMaster,"kudu.table" -> tableName))
>> .mode("overwrite")
>> .kudu
>> 
>> java.lang.RuntimeException: failed to write 1000 rows from DataFrame to 
>> Kudu; sample errors: Not found: key not found (error 0)Not found: key not 
>> found (error 0)Not found: key not found (error 0)Not found: key not found 
>> (error 0)Not found: key not found (error 0)
>> 
>> Does the key field need to be first in the DataFrame?
>> 
>> Thanks,
>> Ben
>> 
>>> On Jun 14, 2016, at 4:28 PM, Dan Burkert >> <mailto:d...@cloudera.com>> wrote:
>>> 
>>> 
>>> 
>>> On Tue, Jun 14, 2016 at 4:20 PM, Benjamin Kim >> <mailto:bbuil...@gmail.com>> wrote:
>>> Dan,
>>> 
>>> Thanks! It got further. Now, how do I set the Primary Key to be a column(s) 
>>> in the DataFrame and set the partitioning? Is it like this?
>>> 
>>> kuduContext.createTable(tableName, df.schema, Seq(“my_id"), new 
>>> CreateTableOptions().setNumReplicas(1).addHashPartitions(“my_id"))
>>> 
>>> java.lang.IllegalArgumentException: Table partitioning must be specified 
>>> using setRangePartitionColumns or addHashPartitions
>>> 
>>> Yep.  The `Seq("my_id")` part of that call is specifying the set of primary 
>>> key columns, so in this case you have specified the single PK column 
>>> "my_id".  The `addHashPartitions` call adds hash partitioning to the table, 
>>> in this case over the column "my_id" (which is good, it must be over one or 
>>> more PK columns, so in this case "my_id" is the one and only valid 
>>> combination).  However, the call to `addHashPartition` also takes the 
>>> number of buckets as the second param.  You shouldn't get the 
>>> IllegalArgumentException as long as you are specifying either 
>>> `addHashPartitions` or `setRangePartitionColumns`.
>>> 
>>> - Dan
>>>  
>>> 
>>> Thanks,
>>> Ben
>>> 
>>> 
>>>> On Jun 14, 2016, at 4:07 PM, Dan Burkert >>> <mailto:d...@cloudera.com>> wrote:
>>>> 
>>>> Looks like we're missing an import statement in that example.  Could you 
>>>> try:
>>>> 
>>>> import org.kududb.client._
>>>> and try again?
>>>> 
>>>> - Dan
>>>> 
>>>> On Tue, Jun 14, 2016 at 4:01 PM, Benjamin Kim >>> <mailto:bbuil...@gmail.com>> wrote:
>>>> I encountered an error trying to create a table based on the documentation 
>>>> from a DataFrame.
>>>> 
>>>> :49: error: not found: type CreateTableOptions
>>>>   kuduContext.createTable(tableName,

Re: Spark on Kudu

2016-06-14 Thread Benjamin Kim

I tried to use the “append” mode, and it worked. Over 3.8 million rows in 64s. 
I would assume that now I can use the “overwrite” mode on existing data. Now, I 
have to find answers to these questions. What would happen if I “append” to the 
data in the Kudu table if the data already exists? What would happen if I 
“overwrite” existing data when the DataFrame has data in it that does not exist 
in the Kudu table? I need to evaluate the best way to simulate the UPSERT 
behavior in HBase because this is what our use case is.

Thanks,
Ben


> On Jun 14, 2016, at 5:05 PM, Benjamin Kim  wrote:
> 
> Hi,
> 
> Now, I’m getting this error when trying to write to the table.
> 
> import scala.collection.JavaConverters._
> val key_seq = Seq(“my_id")
> val key_list = List(“my_id”).asJava
> kuduContext.createTable(tableName, df.schema, key_seq, new 
> CreateTableOptions().setNumReplicas(1).addHashPartitions(key_list, 100))
> 
> df.write
> .options(Map("kudu.master" -> kuduMaster,"kudu.table" -> tableName))
> .mode("overwrite")
> .kudu
> 
> java.lang.RuntimeException: failed to write 1000 rows from DataFrame to Kudu; 
> sample errors: Not found: key not found (error 0)Not found: key not found 
> (error 0)Not found: key not found (error 0)Not found: key not found (error 
> 0)Not found: key not found (error 0)
> 
> Does the key field need to be first in the DataFrame?
> 
> Thanks,
> Ben
> 
>> On Jun 14, 2016, at 4:28 PM, Dan Burkert > <mailto:d...@cloudera.com>> wrote:
>> 
>> 
>> 
>> On Tue, Jun 14, 2016 at 4:20 PM, Benjamin Kim > <mailto:bbuil...@gmail.com>> wrote:
>> Dan,
>> 
>> Thanks! It got further. Now, how do I set the Primary Key to be a column(s) 
>> in the DataFrame and set the partitioning? Is it like this?
>> 
>> kuduContext.createTable(tableName, df.schema, Seq(“my_id"), new 
>> CreateTableOptions().setNumReplicas(1).addHashPartitions(“my_id"))
>> 
>> java.lang.IllegalArgumentException: Table partitioning must be specified 
>> using setRangePartitionColumns or addHashPartitions
>> 
>> Yep.  The `Seq("my_id")` part of that call is specifying the set of primary 
>> key columns, so in this case you have specified the single PK column 
>> "my_id".  The `addHashPartitions` call adds hash partitioning to the table, 
>> in this case over the column "my_id" (which is good, it must be over one or 
>> more PK columns, so in this case "my_id" is the one and only valid 
>> combination).  However, the call to `addHashPartition` also takes the number 
>> of buckets as the second param.  You shouldn't get the 
>> IllegalArgumentException as long as you are specifying either 
>> `addHashPartitions` or `setRangePartitionColumns`.
>> 
>> - Dan
>>  
>> 
>> Thanks,
>> Ben
>> 
>> 
>>> On Jun 14, 2016, at 4:07 PM, Dan Burkert >> <mailto:d...@cloudera.com>> wrote:
>>> 
>>> Looks like we're missing an import statement in that example.  Could you 
>>> try:
>>> 
>>> import org.kududb.client._
>>> and try again?
>>> 
>>> - Dan
>>> 
>>> On Tue, Jun 14, 2016 at 4:01 PM, Benjamin Kim >> <mailto:bbuil...@gmail.com>> wrote:
>>> I encountered an error trying to create a table based on the documentation 
>>> from a DataFrame.
>>> 
>>> :49: error: not found: type CreateTableOptions
>>>   kuduContext.createTable(tableName, df.schema, Seq("key"), new 
>>> CreateTableOptions().setNumReplicas(1))
>>> 
>>> Is there something I’m missing?
>>> 
>>> Thanks,
>>> Ben
>>> 
>>>> On Jun 14, 2016, at 3:00 PM, Jean-Daniel Cryans >>> <mailto:jdcry...@apache.org>> wrote:
>>>> 
>>>> It's only in Cloudera's maven repo: 
>>>> https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/
>>>>  
>>>> <https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/>
>>>> 
>>>> J-D
>>>> 
>>>> On Tue, Jun 14, 2016 at 2:59 PM, Benjamin Kim >>> <mailto:bbuil...@gmail.com>> wrote:
>>>> Hi J-D,
>>>> 
>>>> I installed Kudu 0.9.0 using CM, but I can’t find the kudu-spark jar for 
>>>> spark-shell to use. Can you show me where to find it?
>>>> 
>>>> Thanks,
>>>> Ben
>>>> 
>>>> 
>&g

Re: Spark on Kudu

2016-06-14 Thread Benjamin Kim

Hi,

Now, I’m getting this error when trying to write to the table.

import scala.collection.JavaConverters._
val key_seq = Seq(“my_id")
val key_list = List(“my_id”).asJava
kuduContext.createTable(tableName, df.schema, key_seq, new 
CreateTableOptions().setNumReplicas(1).addHashPartitions(key_list, 100))

df.write
.options(Map("kudu.master" -> kuduMaster,"kudu.table" -> tableName))
.mode("overwrite")
.kudu

java.lang.RuntimeException: failed to write 1000 rows from DataFrame to Kudu; 
sample errors: Not found: key not found (error 0)Not found: key not found 
(error 0)Not found: key not found (error 0)Not found: key not found (error 
0)Not found: key not found (error 0)

Does the key field need to be first in the DataFrame?

Thanks,
Ben

> On Jun 14, 2016, at 4:28 PM, Dan Burkert  wrote:
> 
> 
> 
> On Tue, Jun 14, 2016 at 4:20 PM, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> Dan,
> 
> Thanks! It got further. Now, how do I set the Primary Key to be a column(s) 
> in the DataFrame and set the partitioning? Is it like this?
> 
> kuduContext.createTable(tableName, df.schema, Seq(“my_id"), new 
> CreateTableOptions().setNumReplicas(1).addHashPartitions(“my_id"))
> 
> java.lang.IllegalArgumentException: Table partitioning must be specified 
> using setRangePartitionColumns or addHashPartitions
> 
> Yep.  The `Seq("my_id")` part of that call is specifying the set of primary 
> key columns, so in this case you have specified the single PK column "my_id". 
>  The `addHashPartitions` call adds hash partitioning to the table, in this 
> case over the column "my_id" (which is good, it must be over one or more PK 
> columns, so in this case "my_id" is the one and only valid combination).  
> However, the call to `addHashPartition` also takes the number of buckets as 
> the second param.  You shouldn't get the IllegalArgumentException as long as 
> you are specifying either `addHashPartitions` or `setRangePartitionColumns`.
> 
> - Dan
>  
> 
> Thanks,
> Ben
> 
> 
>> On Jun 14, 2016, at 4:07 PM, Dan Burkert > <mailto:d...@cloudera.com>> wrote:
>> 
>> Looks like we're missing an import statement in that example.  Could you try:
>> 
>> import org.kududb.client._
>> and try again?
>> 
>> - Dan
>> 
>> On Tue, Jun 14, 2016 at 4:01 PM, Benjamin Kim > <mailto:bbuil...@gmail.com>> wrote:
>> I encountered an error trying to create a table based on the documentation 
>> from a DataFrame.
>> 
>> :49: error: not found: type CreateTableOptions
>>   kuduContext.createTable(tableName, df.schema, Seq("key"), new 
>> CreateTableOptions().setNumReplicas(1))
>> 
>> Is there something I’m missing?
>> 
>> Thanks,
>> Ben
>> 
>>> On Jun 14, 2016, at 3:00 PM, Jean-Daniel Cryans >> <mailto:jdcry...@apache.org>> wrote:
>>> 
>>> It's only in Cloudera's maven repo: 
>>> https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/
>>>  
>>> <https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/>
>>> 
>>> J-D
>>> 
>>> On Tue, Jun 14, 2016 at 2:59 PM, Benjamin Kim >> <mailto:bbuil...@gmail.com>> wrote:
>>> Hi J-D,
>>> 
>>> I installed Kudu 0.9.0 using CM, but I can’t find the kudu-spark jar for 
>>> spark-shell to use. Can you show me where to find it?
>>> 
>>> Thanks,
>>> Ben
>>> 
>>> 
>>>> On Jun 8, 2016, at 1:19 PM, Jean-Daniel Cryans >>> <mailto:jdcry...@apache.org>> wrote:
>>>> 
>>>> What's in this doc is what's gonna get released: 
>>>> https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark
>>>>  
>>>> <https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark>
>>>> 
>>>> J-D
>>>> 
>>>> On Tue, Jun 7, 2016 at 8:52 PM, Benjamin Kim >>> <mailto:bbuil...@gmail.com>> wrote:
>>>> Will this be documented with examples once 0.9.0 comes out?
>>>> 
>>>> Thanks,
>>>> Ben
>>>> 
>>>> 
>>>>> On May 28, 2016, at 3:22 PM, Jean-Daniel Cryans >>>> <mailto:jdcry...@apache.org>> wrote:
>>>>> 
>>>>> It will be in 0.9.0.
>>>>> 
>>>>> J-D
>>>>> 
>>&

Re: Spark on Kudu

2016-06-14 Thread Benjamin Kim

Dan,

Thanks! It got further. Now, how do I set the Primary Key to be a column(s) in 
the DataFrame and set the partitioning? Is it like this?

kuduContext.createTable(tableName, df.schema, Seq(“my_id"), new 
CreateTableOptions().setNumReplicas(1).addHashPartitions(“my_id"))

java.lang.IllegalArgumentException: Table partitioning must be specified using 
setRangePartitionColumns or addHashPartitions

Thanks,
Ben


> On Jun 14, 2016, at 4:07 PM, Dan Burkert  wrote:
> 
> Looks like we're missing an import statement in that example.  Could you try:
> 
> import org.kududb.client._
> and try again?
> 
> - Dan
> 
> On Tue, Jun 14, 2016 at 4:01 PM, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> I encountered an error trying to create a table based on the documentation 
> from a DataFrame.
> 
> :49: error: not found: type CreateTableOptions
>   kuduContext.createTable(tableName, df.schema, Seq("key"), new 
> CreateTableOptions().setNumReplicas(1))
> 
> Is there something I’m missing?
> 
> Thanks,
> Ben
> 
>> On Jun 14, 2016, at 3:00 PM, Jean-Daniel Cryans > <mailto:jdcry...@apache.org>> wrote:
>> 
>> It's only in Cloudera's maven repo: 
>> https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/
>>  
>> <https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/>
>> 
>> J-D
>> 
>> On Tue, Jun 14, 2016 at 2:59 PM, Benjamin Kim > <mailto:bbuil...@gmail.com>> wrote:
>> Hi J-D,
>> 
>> I installed Kudu 0.9.0 using CM, but I can’t find the kudu-spark jar for 
>> spark-shell to use. Can you show me where to find it?
>> 
>> Thanks,
>> Ben
>> 
>> 
>>> On Jun 8, 2016, at 1:19 PM, Jean-Daniel Cryans >> <mailto:jdcry...@apache.org>> wrote:
>>> 
>>> What's in this doc is what's gonna get released: 
>>> https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark
>>>  
>>> <https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark>
>>> 
>>> J-D
>>> 
>>> On Tue, Jun 7, 2016 at 8:52 PM, Benjamin Kim >> <mailto:bbuil...@gmail.com>> wrote:
>>> Will this be documented with examples once 0.9.0 comes out?
>>> 
>>> Thanks,
>>> Ben
>>> 
>>> 
>>>> On May 28, 2016, at 3:22 PM, Jean-Daniel Cryans >>> <mailto:jdcry...@apache.org>> wrote:
>>>> 
>>>> It will be in 0.9.0.
>>>> 
>>>> J-D
>>>> 
>>>> On Sat, May 28, 2016 at 8:31 AM, Benjamin Kim >>> <mailto:bbuil...@gmail.com>> wrote:
>>>> Hi Chris,
>>>> 
>>>> Will all this effort be rolled into 0.9.0 and be ready for use?
>>>> 
>>>> Thanks,
>>>> Ben
>>>> 
>>>> 
>>>>> On May 18, 2016, at 9:01 AM, Chris George >>>> <mailto:christopher.geo...@rms.com>> wrote:
>>>>> 
>>>>> There is some code in review that needs some more refinement.
>>>>> It will allow upsert/insert from a dataframe using the datasource api. It 
>>>>> will also allow the creation and deletion of tables from a dataframe
>>>>> http://gerrit.cloudera.org:8080/#/c/2992/ 
>>>>> <http://gerrit.cloudera.org:8080/#/c/2992/>
>>>>> 
>>>>> Example usages will look something like:
>>>>> http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc 
>>>>> <http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc>
>>>>> 
>>>>> -Chris George
>>>>> 
>>>>> 
>>>>> On 5/18/16, 9:45 AM, "Benjamin Kim" >>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>> 
>>>>> Can someone tell me what the state is of this Spark work?
>>>>> 
>>>>> Also, does anyone have any sample code on how to update/insert data in 
>>>>> Kudu using DataFrames?
>>>>> 
>>>>> Thanks,
>>>>> Ben
>>>>> 
>>>>> 
>>>>>> On Apr 13, 2016, at 8:22 AM, Chris George >>>>> <mailto:christopher.geo...@rms.com>> wrote:
>>>>>> 
>>>>>> SparkSQL cannot support these type of statements but we may be able to 
>>>>>> implement similar functionality through t

Re: Spark on Kudu

2016-06-14 Thread Benjamin Kim

I encountered an error trying to create a table based on the documentation from 
a DataFrame.

:49: error: not found: type CreateTableOptions
  kuduContext.createTable(tableName, df.schema, Seq("key"), new 
CreateTableOptions().setNumReplicas(1))

Is there something I’m missing?

Thanks,
Ben

> On Jun 14, 2016, at 3:00 PM, Jean-Daniel Cryans  wrote:
> 
> It's only in Cloudera's maven repo: 
> https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/
>  
> <https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/>
> 
> J-D
> 
> On Tue, Jun 14, 2016 at 2:59 PM, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> Hi J-D,
> 
> I installed Kudu 0.9.0 using CM, but I can’t find the kudu-spark jar for 
> spark-shell to use. Can you show me where to find it?
> 
> Thanks,
> Ben
> 
> 
>> On Jun 8, 2016, at 1:19 PM, Jean-Daniel Cryans > <mailto:jdcry...@apache.org>> wrote:
>> 
>> What's in this doc is what's gonna get released: 
>> https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark
>>  
>> <https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark>
>> 
>> J-D
>> 
>> On Tue, Jun 7, 2016 at 8:52 PM, Benjamin Kim > <mailto:bbuil...@gmail.com>> wrote:
>> Will this be documented with examples once 0.9.0 comes out?
>> 
>> Thanks,
>> Ben
>> 
>> 
>>> On May 28, 2016, at 3:22 PM, Jean-Daniel Cryans >> <mailto:jdcry...@apache.org>> wrote:
>>> 
>>> It will be in 0.9.0.
>>> 
>>> J-D
>>> 
>>> On Sat, May 28, 2016 at 8:31 AM, Benjamin Kim >> <mailto:bbuil...@gmail.com>> wrote:
>>> Hi Chris,
>>> 
>>> Will all this effort be rolled into 0.9.0 and be ready for use?
>>> 
>>> Thanks,
>>> Ben
>>> 
>>> 
>>>> On May 18, 2016, at 9:01 AM, Chris George >>> <mailto:christopher.geo...@rms.com>> wrote:
>>>> 
>>>> There is some code in review that needs some more refinement.
>>>> It will allow upsert/insert from a dataframe using the datasource api. It 
>>>> will also allow the creation and deletion of tables from a dataframe
>>>> http://gerrit.cloudera.org:8080/#/c/2992/ 
>>>> <http://gerrit.cloudera.org:8080/#/c/2992/>
>>>> 
>>>> Example usages will look something like:
>>>> http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc 
>>>> <http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc>
>>>> 
>>>> -Chris George
>>>> 
>>>> 
>>>> On 5/18/16, 9:45 AM, "Benjamin Kim" >>> <mailto:bbuil...@gmail.com>> wrote:
>>>> 
>>>> Can someone tell me what the state is of this Spark work?
>>>> 
>>>> Also, does anyone have any sample code on how to update/insert data in 
>>>> Kudu using DataFrames?
>>>> 
>>>> Thanks,
>>>> Ben
>>>> 
>>>> 
>>>>> On Apr 13, 2016, at 8:22 AM, Chris George >>>> <mailto:christopher.geo...@rms.com>> wrote:
>>>>> 
>>>>> SparkSQL cannot support these type of statements but we may be able to 
>>>>> implement similar functionality through the api.
>>>>> -Chris
>>>>> 
>>>>> On 4/12/16, 5:19 PM, "Benjamin Kim" >>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>> 
>>>>> It would be nice to adhere to the SQL:2003 standard for an “upsert” if it 
>>>>> were to be implemented.
>>>>> 
>>>>> MERGE INTO table_name USING table_reference ON (condition)
>>>>>  WHEN MATCHED THEN
>>>>>  UPDATE SET column1 = value1 [, column2 = value2 ...]
>>>>>  WHEN NOT MATCHED THEN
>>>>>  INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 …])
>>>>> 
>>>>> Cheers,
>>>>> Ben
>>>>> 
>>>>>> On Apr 11, 2016, at 12:21 PM, Chris George >>>>> <mailto:christopher.geo...@rms.com>> wrote:
>>>>>> 
>>>>>> I have a wip kuduRDD that I made a few months ago. I pushed it into 
>>>>>> gerrit if you want to take a look. 
>>>>>> http://gerrit.cloudera.org:8080/#/c/2754/ 
>>>>>> <http

Re: Spark on Kudu

2016-06-14 Thread Benjamin Kim

Thank you.

> On Jun 14, 2016, at 3:00 PM, Jean-Daniel Cryans  wrote:
> 
> It's only in Cloudera's maven repo: 
> https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/
>  
> <https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/>
> 
> J-D
> 
> On Tue, Jun 14, 2016 at 2:59 PM, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> Hi J-D,
> 
> I installed Kudu 0.9.0 using CM, but I can’t find the kudu-spark jar for 
> spark-shell to use. Can you show me where to find it?
> 
> Thanks,
> Ben
> 
> 
>> On Jun 8, 2016, at 1:19 PM, Jean-Daniel Cryans > <mailto:jdcry...@apache.org>> wrote:
>> 
>> What's in this doc is what's gonna get released: 
>> https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark
>>  
>> <https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark>
>> 
>> J-D
>> 
>> On Tue, Jun 7, 2016 at 8:52 PM, Benjamin Kim > <mailto:bbuil...@gmail.com>> wrote:
>> Will this be documented with examples once 0.9.0 comes out?
>> 
>> Thanks,
>> Ben
>> 
>> 
>>> On May 28, 2016, at 3:22 PM, Jean-Daniel Cryans >> <mailto:jdcry...@apache.org>> wrote:
>>> 
>>> It will be in 0.9.0.
>>> 
>>> J-D
>>> 
>>> On Sat, May 28, 2016 at 8:31 AM, Benjamin Kim >> <mailto:bbuil...@gmail.com>> wrote:
>>> Hi Chris,
>>> 
>>> Will all this effort be rolled into 0.9.0 and be ready for use?
>>> 
>>> Thanks,
>>> Ben
>>> 
>>> 
>>>> On May 18, 2016, at 9:01 AM, Chris George >>> <mailto:christopher.geo...@rms.com>> wrote:
>>>> 
>>>> There is some code in review that needs some more refinement.
>>>> It will allow upsert/insert from a dataframe using the datasource api. It 
>>>> will also allow the creation and deletion of tables from a dataframe
>>>> http://gerrit.cloudera.org:8080/#/c/2992/ 
>>>> <http://gerrit.cloudera.org:8080/#/c/2992/>
>>>> 
>>>> Example usages will look something like:
>>>> http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc 
>>>> <http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc>
>>>> 
>>>> -Chris George
>>>> 
>>>> 
>>>> On 5/18/16, 9:45 AM, "Benjamin Kim" >>> <mailto:bbuil...@gmail.com>> wrote:
>>>> 
>>>> Can someone tell me what the state is of this Spark work?
>>>> 
>>>> Also, does anyone have any sample code on how to update/insert data in 
>>>> Kudu using DataFrames?
>>>> 
>>>> Thanks,
>>>> Ben
>>>> 
>>>> 
>>>>> On Apr 13, 2016, at 8:22 AM, Chris George >>>> <mailto:christopher.geo...@rms.com>> wrote:
>>>>> 
>>>>> SparkSQL cannot support these type of statements but we may be able to 
>>>>> implement similar functionality through the api.
>>>>> -Chris
>>>>> 
>>>>> On 4/12/16, 5:19 PM, "Benjamin Kim" >>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>> 
>>>>> It would be nice to adhere to the SQL:2003 standard for an “upsert” if it 
>>>>> were to be implemented.
>>>>> 
>>>>> MERGE INTO table_name USING table_reference ON (condition)
>>>>>  WHEN MATCHED THEN
>>>>>  UPDATE SET column1 = value1 [, column2 = value2 ...]
>>>>>  WHEN NOT MATCHED THEN
>>>>>  INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 …])
>>>>> 
>>>>> Cheers,
>>>>> Ben
>>>>> 
>>>>>> On Apr 11, 2016, at 12:21 PM, Chris George >>>>> <mailto:christopher.geo...@rms.com>> wrote:
>>>>>> 
>>>>>> I have a wip kuduRDD that I made a few months ago. I pushed it into 
>>>>>> gerrit if you want to take a look. 
>>>>>> http://gerrit.cloudera.org:8080/#/c/2754/ 
>>>>>> <http://gerrit.cloudera.org:8080/#/c/2754/>
>>>>>> It does pushdown predicates which the existing input formatter based rdd 
>>>>>> does not.
>>>>>> 
>>>>>> Within the next two weeks I’m planning to implement a datasource for 
>&g

Re: Spark on Kudu

2016-06-14 Thread Benjamin Kim

Hi J-D,

I installed Kudu 0.9.0 using CM, but I can’t find the kudu-spark jar for 
spark-shell to use. Can you show me where to find it?

Thanks,
Ben


> On Jun 8, 2016, at 1:19 PM, Jean-Daniel Cryans  wrote:
> 
> What's in this doc is what's gonna get released: 
> https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark
>  
> <https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark>
> 
> J-D
> 
> On Tue, Jun 7, 2016 at 8:52 PM, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> Will this be documented with examples once 0.9.0 comes out?
> 
> Thanks,
> Ben
> 
> 
>> On May 28, 2016, at 3:22 PM, Jean-Daniel Cryans > <mailto:jdcry...@apache.org>> wrote:
>> 
>> It will be in 0.9.0.
>> 
>> J-D
>> 
>> On Sat, May 28, 2016 at 8:31 AM, Benjamin Kim > <mailto:bbuil...@gmail.com>> wrote:
>> Hi Chris,
>> 
>> Will all this effort be rolled into 0.9.0 and be ready for use?
>> 
>> Thanks,
>> Ben
>> 
>> 
>>> On May 18, 2016, at 9:01 AM, Chris George >> <mailto:christopher.geo...@rms.com>> wrote:
>>> 
>>> There is some code in review that needs some more refinement.
>>> It will allow upsert/insert from a dataframe using the datasource api. It 
>>> will also allow the creation and deletion of tables from a dataframe
>>> http://gerrit.cloudera.org:8080/#/c/2992/ 
>>> <http://gerrit.cloudera.org:8080/#/c/2992/>
>>> 
>>> Example usages will look something like:
>>> http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc 
>>> <http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc>
>>> 
>>> -Chris George
>>> 
>>> 
>>> On 5/18/16, 9:45 AM, "Benjamin Kim" >> <mailto:bbuil...@gmail.com>> wrote:
>>> 
>>> Can someone tell me what the state is of this Spark work?
>>> 
>>> Also, does anyone have any sample code on how to update/insert data in Kudu 
>>> using DataFrames?
>>> 
>>> Thanks,
>>> Ben
>>> 
>>> 
>>>> On Apr 13, 2016, at 8:22 AM, Chris George >>> <mailto:christopher.geo...@rms.com>> wrote:
>>>> 
>>>> SparkSQL cannot support these type of statements but we may be able to 
>>>> implement similar functionality through the api.
>>>> -Chris
>>>> 
>>>> On 4/12/16, 5:19 PM, "Benjamin Kim" >>> <mailto:bbuil...@gmail.com>> wrote:
>>>> 
>>>> It would be nice to adhere to the SQL:2003 standard for an “upsert” if it 
>>>> were to be implemented.
>>>> 
>>>> MERGE INTO table_name USING table_reference ON (condition)
>>>>  WHEN MATCHED THEN
>>>>  UPDATE SET column1 = value1 [, column2 = value2 ...]
>>>>  WHEN NOT MATCHED THEN
>>>>  INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 …])
>>>> 
>>>> Cheers,
>>>> Ben
>>>> 
>>>>> On Apr 11, 2016, at 12:21 PM, Chris George >>>> <mailto:christopher.geo...@rms.com>> wrote:
>>>>> 
>>>>> I have a wip kuduRDD that I made a few months ago. I pushed it into 
>>>>> gerrit if you want to take a look. 
>>>>> http://gerrit.cloudera.org:8080/#/c/2754/ 
>>>>> <http://gerrit.cloudera.org:8080/#/c/2754/>
>>>>> It does pushdown predicates which the existing input formatter based rdd 
>>>>> does not.
>>>>> 
>>>>> Within the next two weeks I’m planning to implement a datasource for 
>>>>> spark that will have pushdown predicates and insertion/update 
>>>>> functionality (need to look more at cassandra and the hbase datasource 
>>>>> for best way to do this) I agree that server side upsert would be helpful.
>>>>> Having a datasource would give us useful data frames and also make spark 
>>>>> sql usable for kudu.
>>>>> 
>>>>> My reasoning for having a spark datasource and not using Impala is: 1. We 
>>>>> have had trouble getting impala to run fast with high concurrency when 
>>>>> compared to spark 2. We interact with datasources which do not integrate 
>>>>> with impala. 3. We have custom sql query planners for extended sql 
>>>>> functionality.
>>>>> 
>>>>> -Chris George
>>>>> 
>>>

Re: [ANNOUNCE] Apache Kudu (incubating) 0.9.0 released

2016-06-14 Thread Benjamin Kim

Hi J-D,

I would like to get this started especially now that UPSERT and Spark SQL 
DataFrames support. But, how do I use Cloudera Manager to deploy it? Is there a 
parcel available yet? Is there a new CSD file to be downloaded? I currently 
have CM 5.7.0 installed.

Thanks,
Ben



> On Jun 10, 2016, at 7:39 AM, Jean-Daniel Cryans  wrote:
> 
> The Apache Kudu (incubating) team is happy to announce the release of Kudu 
> 0.9.0!
> 
> Kudu is an open source storage engine for structured data which supports 
> low-latency random access together with efficient analytical access patterns. 
> It is designed within the context of the Apache Hadoop ecosystem and supports 
> many integrations with other data analytics projects both inside and outside 
> of the Apache Software Foundation.
> 
> This latest version adds basic UPSERT functionality and an improved Apache 
> Spark Data Source that doesn’t rely on the MapReduce I/O formats. It also 
> improves Tablet Server restart time as well as write performance under high 
> load. Finally, Kudu now enforces the specification of a partitioning scheme 
> for new tables.
> 
> Download it here: http://getkudu.io/releases/0.9.0/ 
> 
> 
> Regards,
> 
> The Apache Kudu (incubating) team
> 
> ===
> 
> Apache Kudu (incubating) is an effort undergoing incubation at The Apache 
> Software
> Foundation (ASF), sponsored by the Apache Incubator PMC. Incubation is
> required of all newly accepted projects until a further review
> indicates that the infrastructure, communications, and decision making
> process have stabilized in a manner consistent with other successful
> ASF projects. While incubation status is not necessarily a reflection
> of the completeness or stability of the code, it does indicate that
> the project has yet to be fully endorsed by the ASF.

Re: [ANNOUNCE] Apache Kudu (incubating) 0.9.0 released

2016-06-13 Thread Benjamin Kim

Hi J-D,

I would like to get this started especially now that UPSERT and Spark SQL 
DataFrames support. But, how do I use Cloudera Manager to deploy it? Is there a 
parcel available yet? Is there a new CSD file to be downloaded? I currently 
have CM 5.7.0 installed.

Thanks,
Ben



> On Jun 10, 2016, at 7:39 AM, Jean-Daniel Cryans  wrote:
> 
> The Apache Kudu (incubating) team is happy to announce the release of Kudu 
> 0.9.0!
> 
> Kudu is an open source storage engine for structured data which supports 
> low-latency random access together with efficient analytical access patterns. 
> It is designed within the context of the Apache Hadoop ecosystem and supports 
> many integrations with other data analytics projects both inside and outside 
> of the Apache Software Foundation.
> 
> This latest version adds basic UPSERT functionality and an improved Apache 
> Spark Data Source that doesn’t rely on the MapReduce I/O formats. It also 
> improves Tablet Server restart time as well as write performance under high 
> load. Finally, Kudu now enforces the specification of a partitioning scheme 
> for new tables.
> 
> Download it here: http://getkudu.io/releases/0.9.0/ 
> 
> 
> Regards,
> 
> The Apache Kudu (incubating) team
> 
> ===
> 
> Apache Kudu (incubating) is an effort undergoing incubation at The Apache 
> Software
> Foundation (ASF), sponsored by the Apache Incubator PMC. Incubation is
> required of all newly accepted projects until a further review
> indicates that the infrastructure, communications, and decision making
> process have stabilized in a manner consistent with other successful
> ASF projects. While incubation status is not necessarily a reflection
> of the completeness or stability of the code, it does indicate that
> the project has yet to be fully endorsed by the ASF.

Re: [ANNOUNCE] Apache Kudu (incubating) 0.9.0 released

2016-06-13 Thread Benjamin Kim

Hi J-D,

I would like to get this started especially now that UPSERT and Spark SQL 
DataFrames support. But, how do I use Cloudera Manager to deploy it? Is there a 
parcel available yet? Is there a new CSD file to be downloaded? I currently 
have CM 5.7.0 installed.

Thanks,
Ben



> On Jun 10, 2016, at 7:39 AM, Jean-Daniel Cryans  wrote:
> 
> The Apache Kudu (incubating) team is happy to announce the release of Kudu 
> 0.9.0!
> 
> Kudu is an open source storage engine for structured data which supports 
> low-latency random access together with efficient analytical access patterns. 
> It is designed within the context of the Apache Hadoop ecosystem and supports 
> many integrations with other data analytics projects both inside and outside 
> of the Apache Software Foundation.
> 
> This latest version adds basic UPSERT functionality and an improved Apache 
> Spark Data Source that doesn’t rely on the MapReduce I/O formats. It also 
> improves Tablet Server restart time as well as write performance under high 
> load. Finally, Kudu now enforces the specification of a partitioning scheme 
> for new tables.
> 
> Download it here: http://getkudu.io/releases/0.9.0/ 
> 
> 
> Regards,
> 
> The Apache Kudu (incubating) team
> 
> ===
> 
> Apache Kudu (incubating) is an effort undergoing incubation at The Apache 
> Software
> Foundation (ASF), sponsored by the Apache Incubator PMC. Incubation is
> required of all newly accepted projects until a further review
> indicates that the infrastructure, communications, and decision making
> process have stabilized in a manner consistent with other successful
> ASF projects. While incubation status is not necessarily a reflection
> of the completeness or stability of the code, it does indicate that
> the project has yet to be fully endorsed by the ASF.

Re: phoenix on non-apache hbase

2016-06-12 Thread Benjamin Kim

I saw that Sean Busbey has commented that Phoenix 4.7.1 is about to be released 
for CDH 5.7.0. I am hoping for the good news soon!

> On Jun 9, 2016, at 10:09 PM, Ankur Jain  wrote:
> 
> I have updated my jira with updated instructions 
> https://issues.apache.org/jira/browse/PHOENIX-2834 
> <https://issues.apache.org/jira/browse/PHOENIX-2834>.
> 
> Please do let me know if you are able to build and use with CDH5.7
> 
> Thanks,
> Ankur Jain
> 
> From: Andrew Purtell  <mailto:andrew.purt...@gmail.com>>
> Reply-To: "user@phoenix.apache.org <mailto:user@phoenix.apache.org>" 
> mailto:user@phoenix.apache.org>>
> Date: Friday, 10 June 2016 at 9:06 AM
> To: "user@phoenix.apache.org <mailto:user@phoenix.apache.org>" 
> mailto:user@phoenix.apache.org>>
> Subject: Re: phoenix on non-apache hbase
> 
> Yes a stock client should work with a server modified for CDH assuming both 
> client and server versions are within the bounds specified by the backwards 
> compatibility policy (https://phoenix.apache.org/upgrading.html 
> <https://phoenix.apache.org/upgrading.html>)
> 
> "Phoenix maintains backward compatibility across at least two minor releases 
> to allow for no downtime through server-side rolling restarts upon upgrading."
> 
> 
> On Jun 9, 2016, at 8:09 PM, Koert Kuipers  <mailto:ko...@tresata.com>> wrote:
> 
>> is phoenix client also affect by this? or does phoenix server isolate the 
>> client? 
>> 
>> is it reasonable to expect a "stock" phoenix client to work against a custom 
>> phoenix server for cdh 5.x? (with of course the phoenix client and server 
>> having same phoenix version).
>> 
>> 
>> 
>> On Thu, Jun 9, 2016 at 10:55 PM, Benjamin Kim > <mailto:bbuil...@gmail.com>> wrote:
>>> Andrew,
>>> 
>>> Since we are still on CDH 5.5.2, can I just use your custom version? 
>>> Phoenix is one of the reasons that we are blocked from upgrading to CDH 
>>> 5.7.1. Thus, CDH 5.7.1 is only on our test cluster. One of our developers 
>>> wants to try out the Phoenix Spark plugin. Did you try it out in yours too? 
>>> Does it work if you did?
>>> 
>>> Thanks,
>>> Ben
>>> 
>>> 
>>>> On Jun 9, 2016, at 7:47 PM, Andrew Purtell >>> <mailto:andrew.purt...@gmail.com>> wrote:
>>>> 
>>>> >  is cloudera's hbase 1.2.0-cdh5.7.0 that different from apache HBase 
>>>> > 1.2.0?
>>>> 
>>>> Yes
>>>> 
>>>> As is the Cloudera HBase in 5.6, 5.5, 5.4, ... quite different from Apache 
>>>> HBase in coprocessor and RPC internal extension APIs. 
>>>> 
>>>> We have made some ports of Apache Phoenix releases to CDH here: 
>>>> https://github.com/chiastic-security/phoenix-for-cloudera/tree/4.7-HBase-1.0-cdh5.5
>>>>  
>>>> <https://github.com/chiastic-security/phoenix-for-cloudera/tree/4.7-HBase-1.0-cdh5.5?files=1>
>>>>  
>>>> 
>>>> It's a personal project of mine, not something supported by the community. 
>>>> Sounds like I should look at what to do with CDH 5.6 and 5.7. 
>>>> 
>>>> On Jun 9, 2016, at 7:37 PM, Benjamin Kim >>> <mailto:bbuil...@gmail.com>> wrote:
>>>> 
>>>>> This interests me too. I asked Cloudera in their community forums a while 
>>>>> back but got no answer on this. I hope they don’t leave us out in the 
>>>>> cold. I tried building it too before with the instructions here 
>>>>> https://issues.apache.org/jira/browse/PHOENIX-2834 
>>>>> <https://issues.apache.org/jira/browse/PHOENIX-2834>. I could get it to 
>>>>> build, but I couldn’t get it to work using the Phoenix installation 
>>>>> instructions. For some reason, dropping the server jar into CDH 5.7.0 
>>>>> HBase lib directory didn’t change things. HBase seemed not to use it. Now 
>>>>> that this is out, I’ll give it another try hoping that there is a way. If 
>>>>> anyone has any leads to help, please let me know.
>>>>> 
>>>>> Thanks,
>>>>> Ben
>>>>> 
>>>>> 
>>>>>> On Jun 9, 2016, at 6:39 PM, Josh Elser >>>>> <mailto:josh.el...@gmail.com>> wrote:
>>>>>> 
>>>>>> Koert,
>>>>>> 
>>>>>> Apache Phoenix goes through a lot of work to provide multiple versions 
>>&

Re: phoenix on non-apache hbase

2016-06-09 Thread Benjamin Kim

Andrew,

Since we are still on CDH 5.5.2, can I just use your custom version? Phoenix is 
one of the reasons that we are blocked from upgrading to CDH 5.7.1. Thus, CDH 
5.7.1 is only on our test cluster. One of our developers wants to try out the 
Phoenix Spark plugin. Did you try it out in yours too? Does it work if you did?

Thanks,
Ben


> On Jun 9, 2016, at 7:47 PM, Andrew Purtell  wrote:
> 
> >  is cloudera's hbase 1.2.0-cdh5.7.0 that different from apache HBase 1.2.0?
> 
> Yes
> 
> As is the Cloudera HBase in 5.6, 5.5, 5.4, ... quite different from Apache 
> HBase in coprocessor and RPC internal extension APIs. 
> 
> We have made some ports of Apache Phoenix releases to CDH here: 
> https://github.com/chiastic-security/phoenix-for-cloudera/tree/4.7-HBase-1.0-cdh5.5
>  
> <https://github.com/chiastic-security/phoenix-for-cloudera/tree/4.7-HBase-1.0-cdh5.5?files=1>
>  
> 
> It's a personal project of mine, not something supported by the community. 
> Sounds like I should look at what to do with CDH 5.6 and 5.7. 
> 
> On Jun 9, 2016, at 7:37 PM, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> 
>> This interests me too. I asked Cloudera in their community forums a while 
>> back but got no answer on this. I hope they don’t leave us out in the cold. 
>> I tried building it too before with the instructions here 
>> https://issues.apache.org/jira/browse/PHOENIX-2834 
>> <https://issues.apache.org/jira/browse/PHOENIX-2834>. I could get it to 
>> build, but I couldn’t get it to work using the Phoenix installation 
>> instructions. For some reason, dropping the server jar into CDH 5.7.0 HBase 
>> lib directory didn’t change things. HBase seemed not to use it. Now that 
>> this is out, I’ll give it another try hoping that there is a way. If anyone 
>> has any leads to help, please let me know.
>> 
>> Thanks,
>> Ben
>> 
>> 
>>> On Jun 9, 2016, at 6:39 PM, Josh Elser >> <mailto:josh.el...@gmail.com>> wrote:
>>> 
>>> Koert,
>>> 
>>> Apache Phoenix goes through a lot of work to provide multiple versions of 
>>> Phoenix for various versions of Apache HBase (0.98, 1.1, and 1.2 
>>> presently). The builds for each of these branches are tested against those 
>>> specific versions of HBase, so I doubt that there are issues between Apache 
>>> Phoenix and the corresponding version of Apache HBase.
>>> 
>>> In general, I believe older versions of Phoenix clients can work against 
>>> newer versions of Phoenix running in HBase; but, of course, you'd be much 
>>> better off using equivalent versions on both client and server.
>>> 
>>> If you are having issues running Apache Phoenix over vendor-creations of 
>>> HBase, I would encourage you to reach out on said-vendor's support channels.
>>> 
>>> - Josh
>>> 
>>> Koert Kuipers wrote:
>>>> hello all,
>>>> 
>>>> i decided i wanted to give phoenix a try on our cdh 5.7.0 cluster. so i
>>>> download phoenix, see that the master is already for hbase 1.2.0, change
>>>> the hbase version to 1.2.0-cdh5.7.0, and tell maven to run tests make
>>>> the package, expecting not much trouble.
>>>> 
>>>> but i was wrong... plenty of compilation errors, and some serious
>>>> incompatibilities (tetra?).
>>>> 
>>>> yikes. what happened? why is it so hard to compile for a distro's hbase?
>>>> i do this all the time for vendor-specific hadoop versions without
>>>> issues. is cloudera's hbase 1.2.0-cdh5.7.0 that different from apache
>>>> hbase 1.2.0?
>>>> 
>>>> assuming i get the phoenix-server working for hbase 1.2.0-cdh5.7.0, how
>>>> sensitive is the phoenix-client to the hbase version? can i at least
>>>> assume all the pain is in the phoenix-server and i can ship a generic
>>>> phoenix-client with my software that works on all clusters with the same
>>>> phoenix-server version installed?
>>>> 
>>>> thanks! best, koert
>>

Re: phoenix on non-apache hbase

2016-06-09 Thread Benjamin Kim

This interests me too. I asked Cloudera in their community forums a while back 
but got no answer on this. I hope they don’t leave us out in the cold. I tried 
building it too before with the instructions here 
https://issues.apache.org/jira/browse/PHOENIX-2834. I could get it to build, 
but I couldn’t get it to work using the Phoenix installation instructions. For 
some reason, dropping the server jar into CDH 5.7.0 HBase lib directory didn’t 
change things. HBase seemed not to use it. Now that this is out, I’ll give it 
another try hoping that there is a way. If anyone has any leads to help, please 
let me know.

Thanks,
Ben


> On Jun 9, 2016, at 6:39 PM, Josh Elser  wrote:
> 
> Koert,
> 
> Apache Phoenix goes through a lot of work to provide multiple versions of 
> Phoenix for various versions of Apache HBase (0.98, 1.1, and 1.2 presently). 
> The builds for each of these branches are tested against those specific 
> versions of HBase, so I doubt that there are issues between Apache Phoenix 
> and the corresponding version of Apache HBase.
> 
> In general, I believe older versions of Phoenix clients can work against 
> newer versions of Phoenix running in HBase; but, of course, you'd be much 
> better off using equivalent versions on both client and server.
> 
> If you are having issues running Apache Phoenix over vendor-creations of 
> HBase, I would encourage you to reach out on said-vendor's support channels.
> 
> - Josh
> 
> Koert Kuipers wrote:
>> hello all,
>> 
>> i decided i wanted to give phoenix a try on our cdh 5.7.0 cluster. so i
>> download phoenix, see that the master is already for hbase 1.2.0, change
>> the hbase version to 1.2.0-cdh5.7.0, and tell maven to run tests make
>> the package, expecting not much trouble.
>> 
>> but i was wrong... plenty of compilation errors, and some serious
>> incompatibilities (tetra?).
>> 
>> yikes. what happened? why is it so hard to compile for a distro's hbase?
>> i do this all the time for vendor-specific hadoop versions without
>> issues. is cloudera's hbase 1.2.0-cdh5.7.0 that different from apache
>> hbase 1.2.0?
>> 
>> assuming i get the phoenix-server working for hbase 1.2.0-cdh5.7.0, how
>> sensitive is the phoenix-client to the hbase version? can i at least
>> assume all the pain is in the phoenix-server and i can ship a generic
>> phoenix-client with my software that works on all clusters with the same
>> phoenix-server version installed?
>> 
>> thanks! best, koert

Github Integration

2016-06-09 Thread Benjamin Kim

I heard that Zeppelin 0.6.0 is able to use its local notebook directory as a 
Github repo. Does anyone know of a way to have it work (workaround) with our 
company’s Github (Stash) repo server?

Any advice would be welcome.

Thanks,
Ben

Re: Spark on Kudu

2016-06-07 Thread Benjamin Kim

Will this be documented with examples once 0.9.0 comes out?

Thanks,
Ben

> On May 28, 2016, at 3:22 PM, Jean-Daniel Cryans  wrote:
> 
> It will be in 0.9.0.
> 
> J-D
> 
> On Sat, May 28, 2016 at 8:31 AM, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> Hi Chris,
> 
> Will all this effort be rolled into 0.9.0 and be ready for use?
> 
> Thanks,
> Ben
> 
> 
>> On May 18, 2016, at 9:01 AM, Chris George > <mailto:christopher.geo...@rms.com>> wrote:
>> 
>> There is some code in review that needs some more refinement.
>> It will allow upsert/insert from a dataframe using the datasource api. It 
>> will also allow the creation and deletion of tables from a dataframe
>> http://gerrit.cloudera.org:8080/#/c/2992/ 
>> <http://gerrit.cloudera.org:8080/#/c/2992/>
>> 
>> Example usages will look something like:
>> http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc 
>> <http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc>
>> 
>> -Chris George
>> 
>> 
>> On 5/18/16, 9:45 AM, "Benjamin Kim" > <mailto:bbuil...@gmail.com>> wrote:
>> 
>> Can someone tell me what the state is of this Spark work?
>> 
>> Also, does anyone have any sample code on how to update/insert data in Kudu 
>> using DataFrames?
>> 
>> Thanks,
>> Ben
>> 
>> 
>>> On Apr 13, 2016, at 8:22 AM, Chris George >> <mailto:christopher.geo...@rms.com>> wrote:
>>> 
>>> SparkSQL cannot support these type of statements but we may be able to 
>>> implement similar functionality through the api.
>>> -Chris
>>> 
>>> On 4/12/16, 5:19 PM, "Benjamin Kim" >> <mailto:bbuil...@gmail.com>> wrote:
>>> 
>>> It would be nice to adhere to the SQL:2003 standard for an “upsert” if it 
>>> were to be implemented.
>>> 
>>> MERGE INTO table_name USING table_reference ON (condition)
>>>  WHEN MATCHED THEN
>>>  UPDATE SET column1 = value1 [, column2 = value2 ...]
>>>  WHEN NOT MATCHED THEN
>>>  INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 …])
>>> 
>>> Cheers,
>>> Ben
>>> 
>>>> On Apr 11, 2016, at 12:21 PM, Chris George >>> <mailto:christopher.geo...@rms.com>> wrote:
>>>> 
>>>> I have a wip kuduRDD that I made a few months ago. I pushed it into gerrit 
>>>> if you want to take a look. http://gerrit.cloudera.org:8080/#/c/2754/ 
>>>> <http://gerrit.cloudera.org:8080/#/c/2754/>
>>>> It does pushdown predicates which the existing input formatter based rdd 
>>>> does not.
>>>> 
>>>> Within the next two weeks I’m planning to implement a datasource for spark 
>>>> that will have pushdown predicates and insertion/update functionality 
>>>> (need to look more at cassandra and the hbase datasource for best way to 
>>>> do this) I agree that server side upsert would be helpful.
>>>> Having a datasource would give us useful data frames and also make spark 
>>>> sql usable for kudu.
>>>> 
>>>> My reasoning for having a spark datasource and not using Impala is: 1. We 
>>>> have had trouble getting impala to run fast with high concurrency when 
>>>> compared to spark 2. We interact with datasources which do not integrate 
>>>> with impala. 3. We have custom sql query planners for extended sql 
>>>> functionality.
>>>> 
>>>> -Chris George
>>>> 
>>>> 
>>>> On 4/11/16, 12:22 PM, "Jean-Daniel Cryans" >>> <mailto:jdcry...@apache.org>> wrote:
>>>> 
>>>> You guys make a convincing point, although on the upsert side we'll need 
>>>> more support from the servers. Right now all you can do is an INSERT then, 
>>>> if you get a dup key, do an UPDATE. I guess we could at least add an API 
>>>> on the client side that would manage it, but it wouldn't be atomic.
>>>> 
>>>> J-D
>>>> 
>>>> On Mon, Apr 11, 2016 at 9:34 AM, Mark Hamstra >>> <mailto:m...@clearstorydata.com>> wrote:
>>>> It's pretty simple, actually.  I need to support versioned datasets in a 
>>>> Spark SQL environment.  Instead of a hack on top of a Parquet data store, 
>>>> I'm hoping (among other reasons) to be able to use Kudu's write and 
>>>> timestamp-based read operations to support not only append

Re: Save to a Partitioned Table using a Derived Column

2016-06-03 Thread Benjamin Kim

ENT '',
>`host_domain` string COMMENT '',
>`browser_type` string COMMENT '',
>`browser_device_cat` string COMMENT '',
>`browser_family` string COMMENT '',
>`browser_name` string COMMENT '',
>`browser_version` string COMMENT '',
>`browser_major_version` string COMMENT '',
>`browser_minor_version` string COMMENT '',
>`os_family` string COMMENT '',
>`os_name` string COMMENT '',
>`os_version` string COMMENT '',
>`os_major_version` string COMMENT '',
>`os_minor_version` string COMMENT '')
>  PARTITIONED BY (`dt` timestamp)
>  STORED AS PARQUET;
> desc formatted amo_bi_events;
> 
> The output in hive is as follows:
> 
> # col_name  data_type   comment
> event_type  string
> timestamp   string
> event_valid int
> event_subtype   string
> user_ip string
> user_id string
> cookie_status   string
> profile_status  string
> user_status string
> previous_timestamp  string
> user_agent  string
> referer string
> uri string
> request_elapsed bigint
> browser_languages   string
> acamp_idint
> creative_id int
> location_id int
> pcamp_idint
> pdomain_id  int
> country string
> region  string
> dma int
> citystring
> zip string
> isp string
> line_speed  string
> gender  string
> year_of_birth   int
> behaviors_read  string
> behaviors_written   string
> key_value_pairs string
> acamp_candidatesint
> tag_format  string
> optimizer_name  string
> optimizer_version   string
> optimizer_ipstring
> pixel_idint
> video_idstring
> video_network_idint
> video_time_watched  bigint
> video_percentage_watchedint
> conversion_valid_sale   int
> conversion_sale_amount  float
> conversion_commission_amountfloat
> conversion_step int
> conversion_currency string
> conversion_attribution  int
> conversion_offer_id string
> custom_info string
> frequency   int
> recency_seconds int
> costfloat
> revenue float
> optimizer_acamp_id  int
> optimizer_creative_id   int
> optimizer_ecpm  float
> event_idstring
> impression_id   string
> diagnostic_data string
> user_profile_mapping_source string
> latitudefloat
> longitude   float
> area_code   int
> gmt_offset  string
> in_dst  string
> proxy_type  string
> mobile_carrier  string
> pop string
> hostnamestring
> profile_ttl string
> timestamp_iso   string
> reference_idstring
> identity_organization   string
> identity_method string
> mappable_id string
> profile_expires string
> video_player_iframedint
> video_player_in_viewint
> video_player_width  int
> video_player_height int
> host_domain string
> browser_typestring
> browser_device_cat  string
> browser_family  string
> browser_namestring
> browser_version string
> browser_major_version   string
> browser_minor_version   string
> os_family   string
> os_name string
> os_version  string
> os_major_versionstring
> os_minor_versionstring
> # Partition Information
> # col_name  data_type   comment
> dt  timestamp
> # Detailed Table Information
> Database:   test
> Owner:  hduser
> CreateTime: Fri Jun 03 19:03:20 BST 2016
> LastAccessTime: UNKNOWN
> Retention:  0
> Location:   
> hdfs://rhes564:9000/user/hive/warehouse/test.db/amo_bi_events
> Table Type: EXTERNAL_TABLE
> Table Parameters:
> EXTERNALTRUE
> transient_lastDdlTime   1464977000
> # Storage Information
> SerDe Library:  
> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
> InputFormat:

Re: Save to a Partitioned Table using a Derived Column

2016-06-03 Thread Benjamin Kim

Mich,

I am using .withColumn to add another column “dt” that is a reformatted version 
of an existing column “timestamp”. The partitioned by column is “dt”.

We are using Spark 1.6.0 in CDH 5.7.0.

Thanks,
Ben

> On Jun 3, 2016, at 10:33 AM, Mich Talebzadeh  
> wrote:
> 
> what version of spark are you using
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 3 June 2016 at 17:51, Mich Talebzadeh  <mailto:mich.talebza...@gmail.com>> wrote:
> ok what is the new column is called? you are basically adding a new column to 
> an already existing table
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 3 June 2016 at 17:04, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> The table already exists.
> 
>  CREATE EXTERNAL TABLE `amo_bi_events`(   
>`event_type` string COMMENT '',
>   
>`timestamp` string COMMENT '', 
>   
>`event_valid` int COMMENT '',  
>   
>`event_subtype` string COMMENT '', 
>   
>`user_ip` string COMMENT '',   
>   
>`user_id` string COMMENT '',   
>   
>`cookie_status` string COMMENT '', 
>   
>`profile_status` string COMMENT '',
>   
>`user_status` string COMMENT '',   
>   
>`previous_timestamp` string COMMENT '',
>   
>`user_agent` string COMMENT '',
>   
>`referer` string COMMENT '',   
>   
>`uri` string COMMENT '',   
>   
>`request_elapsed` bigint COMMENT '',   
>   
>`browser_languages` string COMMENT '', 
>   
>`acamp_id` int COMMENT '', 
>   
>`creative_id` int COMMENT '',  
>   
>`location_id` int COMMENT '',  
>   
>`pcamp_id` int COMMENT '', 
>   
>`pdomain_id` int COMMENT '',   
>   
>`country` string COMMENT '',   
>   
>`region` string COMMENT '',
>   
>`dma` int COMMENT '',  
>   
>`city` string COMMENT '',  
>   
>`zip` string COMMENT '',   
>   
>`isp` string COMMENT '',   
>   
>`line_speed` string COMMENT '',
>   
>`gender` string COMMENT '',
>   
>`year_of_birth` int COMMENT '',
>   
>`behaviors_read` string COMMENT '',
>   
>`behaviors_written` string COMMENT '', 
>   
>`key_value_pairs` string COMMENT '',   
>   
>`acamp_candidates` int COMMENT '', 
>

Re: Save to a Partitioned Table using a Derived Column

2016-06-03 Thread Benjamin Kim

 

   `recency_seconds` int COMMENT '',

   `cost` float COMMENT '', 

   `revenue` float COMMENT '',  

   `optimizer_acamp_id` int COMMENT '', 

   `optimizer_creative_id` int COMMENT '',  

   `optimizer_ecpm` float COMMENT '',   

   `event_id` string COMMENT '',

   `impression_id` string COMMENT '',   

   `diagnostic_data` string COMMENT '', 

   `user_profile_mapping_source` string COMMENT '', 

   `latitude` float COMMENT '', 

   `longitude` float COMMENT '',

   `area_code` int COMMENT '',  

   `gmt_offset` string COMMENT '',  

   `in_dst` string COMMENT '',  

   `proxy_type` string COMMENT '',  

   `mobile_carrier` string COMMENT '',  

   `pop` string COMMENT '', 

   `hostname` string COMMENT '',

   `profile_ttl` string COMMENT '', 

   `timestamp_iso` string COMMENT '',   

   `reference_id` string COMMENT '',

   `identity_organization` string COMMENT '',   

   `identity_method` string COMMENT '', 

   `mappable_id` string COMMENT '', 

   `profile_expires` string COMMENT '', 

   `video_player_iframed` int COMMENT '',   

   `video_player_in_view` int COMMENT '',   

   `video_player_width` int COMMENT '', 

   `video_player_height` int COMMENT '',

   `host_domain` string COMMENT '', 

   `browser_type` string COMMENT '',

   `browser_device_cat` string COMMENT '',  

   `browser_family` string COMMENT '',  

   `browser_name` string COMMENT '',

   `browser_version` string COMMENT '', 

   `browser_major_version` string COMMENT '',   

   `browser_minor_version` string COMMENT '',   

   `os_family` string COMMENT '',   

   `os_name` string COMMENT '', 

   `os_version` string COMMENT '',  

   `os_major_version` string COMMENT '',

   `os_minor_version` string COMMENT '')

 PARTITIONED BY (`dt` timestamp)
 
 STORED AS PARQUET;

Thanks,
Ben


> On Jun 3, 2016, at 8:47 AM, Mich Talebzadeh  wrote:
> 
> hang on are you saving this as a new table?
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 3 June 2016 at 14:13, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> Does anyone know how to save data in a DataFrame to a table partitioned using 
> an existing column reformatted into a derived column?
> 
> val p

Save to a Partitioned Table using a Derived Column

2016-06-03 Thread Benjamin Kim

Does anyone know how to save data in a DataFrame to a table partitioned using 
an existing column reformatted into a derived column?

val partitionedDf = df.withColumn("dt", 
concat(substring($"timestamp", 1, 10), lit(" "), substring($"timestamp", 12, 
2), lit(":00")))

sqlContext.setConf("hive.exec.dynamic.partition", "true")
sqlContext.setConf("hive.exec.dynamic.partition.mode", 
"nonstrict")
partitionedDf.write
.mode(SaveMode.Append)
.partitionBy("dt")
.saveAsTable("ds.amo_bi_events")

I am getting an ArrayOutOfBounds error. There are 83 columns in the destination 
table. But after adding the derived column, then I get an 84 error. I assumed 
that the column used for the partition would not be counted.

Can someone please help.

Thanks,
Ben
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark on Kudu

2016-05-28 Thread Benjamin Kim

JD,

That’s awesome! I can’t wait to start working with it.

Thanks,
Ben


> On May 28, 2016, at 3:22 PM, Jean-Daniel Cryans  wrote:
> 
> It will be in 0.9.0.
> 
> J-D
> 
> On Sat, May 28, 2016 at 8:31 AM, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> Hi Chris,
> 
> Will all this effort be rolled into 0.9.0 and be ready for use?
> 
> Thanks,
> Ben
> 
> 
>> On May 18, 2016, at 9:01 AM, Chris George > <mailto:christopher.geo...@rms.com>> wrote:
>> 
>> There is some code in review that needs some more refinement.
>> It will allow upsert/insert from a dataframe using the datasource api. It 
>> will also allow the creation and deletion of tables from a dataframe
>> http://gerrit.cloudera.org:8080/#/c/2992/ 
>> <http://gerrit.cloudera.org:8080/#/c/2992/>
>> 
>> Example usages will look something like:
>> http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc 
>> <http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc>
>> 
>> -Chris George
>> 
>> 
>> On 5/18/16, 9:45 AM, "Benjamin Kim" > <mailto:bbuil...@gmail.com>> wrote:
>> 
>> Can someone tell me what the state is of this Spark work?
>> 
>> Also, does anyone have any sample code on how to update/insert data in Kudu 
>> using DataFrames?
>> 
>> Thanks,
>> Ben
>> 
>> 
>>> On Apr 13, 2016, at 8:22 AM, Chris George >> <mailto:christopher.geo...@rms.com>> wrote:
>>> 
>>> SparkSQL cannot support these type of statements but we may be able to 
>>> implement similar functionality through the api.
>>> -Chris
>>> 
>>> On 4/12/16, 5:19 PM, "Benjamin Kim" >> <mailto:bbuil...@gmail.com>> wrote:
>>> 
>>> It would be nice to adhere to the SQL:2003 standard for an “upsert” if it 
>>> were to be implemented.
>>> 
>>> MERGE INTO table_name USING table_reference ON (condition)
>>>  WHEN MATCHED THEN
>>>  UPDATE SET column1 = value1 [, column2 = value2 ...]
>>>  WHEN NOT MATCHED THEN
>>>  INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 …])
>>> 
>>> Cheers,
>>> Ben
>>> 
>>>> On Apr 11, 2016, at 12:21 PM, Chris George >>> <mailto:christopher.geo...@rms.com>> wrote:
>>>> 
>>>> I have a wip kuduRDD that I made a few months ago. I pushed it into gerrit 
>>>> if you want to take a look. http://gerrit.cloudera.org:8080/#/c/2754/ 
>>>> <http://gerrit.cloudera.org:8080/#/c/2754/>
>>>> It does pushdown predicates which the existing input formatter based rdd 
>>>> does not.
>>>> 
>>>> Within the next two weeks I’m planning to implement a datasource for spark 
>>>> that will have pushdown predicates and insertion/update functionality 
>>>> (need to look more at cassandra and the hbase datasource for best way to 
>>>> do this) I agree that server side upsert would be helpful.
>>>> Having a datasource would give us useful data frames and also make spark 
>>>> sql usable for kudu.
>>>> 
>>>> My reasoning for having a spark datasource and not using Impala is: 1. We 
>>>> have had trouble getting impala to run fast with high concurrency when 
>>>> compared to spark 2. We interact with datasources which do not integrate 
>>>> with impala. 3. We have custom sql query planners for extended sql 
>>>> functionality.
>>>> 
>>>> -Chris George
>>>> 
>>>> 
>>>> On 4/11/16, 12:22 PM, "Jean-Daniel Cryans" >>> <mailto:jdcry...@apache.org>> wrote:
>>>> 
>>>> You guys make a convincing point, although on the upsert side we'll need 
>>>> more support from the servers. Right now all you can do is an INSERT then, 
>>>> if you get a dup key, do an UPDATE. I guess we could at least add an API 
>>>> on the client side that would manage it, but it wouldn't be atomic.
>>>> 
>>>> J-D
>>>> 
>>>> On Mon, Apr 11, 2016 at 9:34 AM, Mark Hamstra >>> <mailto:m...@clearstorydata.com>> wrote:
>>>> It's pretty simple, actually.  I need to support versioned datasets in a 
>>>> Spark SQL environment.  Instead of a hack on top of a Parquet data store, 
>>>> I'm hoping (among other reasons) to be able to use Kudu's write and 
>>>> timestamp-based read operations to support not only append

< 1 2 3 4 >

101 - 200 of 361 matches

Mail list logo