Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Benjamin Kim
, please let me know. Thanks, Ben > On Apr 9, 2016, at 2:49 PM, Benjamin Kim wrote: > > This was easy! > > I just created a notification on a source S3 bucket to kick off a Lambda > function that would decompress the dropped file and save it to another S3 > bucket. In return,

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Benjamin Kim
endpoint of this notification. This would then convey to a listening Spark Streaming job the file information to download. I like this! Cheers, Ben > On Apr 9, 2016, at 9:54 AM, Benjamin Kim wrote: > > This is awesome! I have someplace to start from. > > Thanks, > Ben

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Benjamin Kim
t; Sent from my iPhone > > On Apr 9, 2016, at 9:55 AM, Benjamin Kim <mailto:bbuil...@gmail.com>> wrote: > >> Nezih, >> >> This looks like a good alternative to having the Spark Streaming job check >> for new files on its own. Do you know if there is a wa

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Benjamin Kim
sKey", > AWSSecretAccessKey) > > val inputS3Stream = ssc.textFileStream("s3://example_bucket/folder") > > This code will probe for new S3 files created in your every batch interval. > > Thanks, > Natu > > On Fri, Apr 8, 2016 at 9:14 PM, Benja

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Benjamin Kim
ion.set("fs.s3n.awsAccessKeyId", > AccessKeyId) > ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", > AWSSecretAccessKey) > > val inputS3Stream = ssc.textFileStream("s3://example_bucket/folder") > > This code will probe for new S3 files created i

Monitoring S3 Bucket with Spark Streaming

2016-04-08 Thread Benjamin Kim
Has anyone monitored an S3 bucket or directory using Spark Streaming and pulled any new files to process? If so, can you provide basic Scala coding help on this? Thanks, Ben - To unsubscribe, e-mail: user-unsubscr...@spark.apa

can spark-csv package accept strings instead of files?

2016-04-01 Thread Benjamin Kim
Does anyone know if this is possible? I have an RDD loaded with rows of CSV data strings. Each string representing the header row and multiple rows of data along with delimiters. I would like to feed each thru a CSV parser to convert the data into a dataframe and, ultimately, UPSERT a Hive/HBase

Re: Does Spark CSV accept a CSV String

2016-03-30 Thread Benjamin Kim
7;),'-MM-dd')) > AS TransactionDate > , TransactionType > , Description > , Value > , Balance > , AccountName > , AccountNumber > FROM tmp > """ > sql(sqltext) > > println ("\nFinished at&quo

Re: Does Spark CSV accept a CSV String

2016-03-30 Thread Benjamin Kim
lebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > http://talebzadehmich.wordpress.com <http://talebzadehmi

Does Spark CSV accept a CSV String

2016-03-30 Thread Benjamin Kim
I have a quick question. I have downloaded multiple zipped files from S3 and unzipped each one of them into strings. The next step is to parse using a CSV parser. I want to know if there is a way to easily use the spark csv package for this? Thanks, Ben -

BinaryFiles to ZipInputStream

2016-03-23 Thread Benjamin Kim
I need a little help. I am loading into Spark 1.6 zipped csv files stored in s3. First of all, I am able to get the List of file keys that have a modified date within a range of time by using the AWS SDK Objects (AmazonS3Client, ObjectListing, S3ObjectSummary, ListObjectsRequest, GetObjectReques

Re: new object store driver for Spark

2016-03-22 Thread Benjamin Kim
Hi Gil, Currently, our company uses S3 heavily for data storage. Can you further explain the benefits of this in relation to S3 when the pending patch does come out? Also, I have heard of Swift from others. Can you explain to me the pros and cons of Swift compared to HDFS? It can be just a brie

Re: S3 Zip File Loading Advice

2016-03-15 Thread Benjamin Kim
ould you wrap the ZipInputStream in a List, since a subtype of > TraversableOnce[?] is required? > > case (name, content) => List(new ZipInputStream(content.open)) > > Xinh > > On Wed, Mar 9, 2016 at 7:07 AM, Benjamin Kim <mailto:bbuil...@gmail.com>> wrote: >

Re: Spark Job on YARN accessing Hbase Table

2016-03-13 Thread Benjamin Kim
compressionByName() resides in class with @InterfaceAudience.Private which > got moved in master branch. > > So looks like there is some work to be done for backporting to branch-1 :-) > > On Sun, Mar 13, 2016 at 1:35 PM, Benjamin Kim <mailto:bbuil...@gmail.com>> wrote: &

Re: Spark Job on YARN accessing Hbase Table

2016-03-13 Thread Benjamin Kim
s > > On Sun, Mar 13, 2016 at 11:39 AM, Benjamin Kim <mailto:bbuil...@gmail.com>> wrote: > Hi Ted, > > I see that you’re working on the hbase-spark module for hbase. I recently > packaged the SparkOnHBase project and gave it a test run. It works like a > charm on CDH

Re: Spark Job on YARN accessing Hbase Table

2016-03-13 Thread Benjamin Kim
to root pom.xml: > hbase-spark > > Then you would be able to build the module yourself. > > hbase-spark module uses APIs which are compatible with hbase 1.0 > > Cheers > > On Sun, Mar 13, 2016 at 11:39 AM, Benjamin Kim <mailto:bbuil...@gmail.com>> wrote: &g

Re: Spark Job on YARN accessing Hbase Table

2016-03-13 Thread Benjamin Kim
Hi Ted, I see that you’re working on the hbase-spark module for hbase. I recently packaged the SparkOnHBase project and gave it a test run. It works like a charm on CDH 5.4 and 5.5. All I had to do was add /opt/cloudera/parcels/CDH/jars/htrace-core-3.1.0-incubating.jar to the classpath.txt fil

Re: S3 Zip File Loading Advice

2016-03-09 Thread Benjamin Kim
h zip? Single file archives are processed just > like text as long as it is one of the supported compression formats. > > Regards > Sab > > On Wed, Mar 9, 2016 at 10:33 AM, Benjamin Kim <mailto:bbuil...@gmail.com>> wrote: > I am wondering if anyone can help. > >

S3 Zip File Loading Advice

2016-03-08 Thread Benjamin Kim
I am wondering if anyone can help. Our company stores zipped CSV files in S3, which has been a big headache from the start. I was wondering if anyone has created a way to iterate through several subdirectories (s3n://events/2016/03/01/00, s3n://2016/03/01/01, etc.) in S3 to find the newest file

Re: Steps to Run Spark Scala job from Oozie on EC2 Hadoop clsuter

2016-03-07 Thread Benjamin Kim
To comment… At my company, we have not gotten it to work in any other mode than local. If we try any of the yarn modes, it fails with a “file does not exist” error when trying to locate the executable jar. I mentioned this to the Hue users group, which we used for this, and they replied that th

Re: SFTP Compressed CSV into Dataframe

2016-03-03 Thread Benjamin Kim
ote: > > (-user) > > On Thursday 03 March 2016 10:09 PM, Benjamin Kim wrote: >> I forgot to mention that we will be scheduling this job using Oozie. So, we >> will not be able to know which worker node is going to being running this. >> If we try to do anything local, i

Re: SFTP Compressed CSV into Dataframe

2016-03-03 Thread Benjamin Kim
On Mar 2, 2016, at 11:17 AM, Benjamin Kim wrote: > > I wonder if anyone has opened a SFTP connection to open a remote GZIP CSV > file? I am able to download the file first locally using the SFTP Client in > the spark-sftp package. Then, I load the file into a dataframe using the >

Re: Building a REST Service with Spark back-end

2016-03-02 Thread Benjamin Kim
I want to ask about something related to this. Does anyone know if there is or will be a command line equivalent of spark-shell client for Livy Spark Server or any other Spark Job Server? The reason that I am asking spark-shell does not handle multiple users on the same server well. Since a Spa

SFTP Compressed CSV into Dataframe

2016-03-02 Thread Benjamin Kim
I wonder if anyone has opened a SFTP connection to open a remote GZIP CSV file? I am able to download the file first locally using the SFTP Client in the spark-sftp package. Then, I load the file into a dataframe using the spark-csv package, which automatically decompresses the file. I just want

Re: SparkOnHBase : Which version of Spark its available

2016-02-17 Thread Benjamin Kim
Ted, Any idea as to when this will be released? Thanks, Ben > On Feb 17, 2016, at 2:53 PM, Ted Yu wrote: > > The HBASE JIRA below is for HBase 2.0 > > HBase Spark module would be back ported to hbase 1.3.0 > > FYI > > On Feb 17, 2016, at 1:13 PM, Chandeep Singh

Re: spark 1.6.0 connect to hive metastore

2016-02-09 Thread Benjamin Kim
I got the same problem when I added the Phoenix plugin jar in the driver and executor extra classpaths. Do you have those set too? > On Feb 9, 2016, at 1:12 PM, Koert Kuipers wrote: > > yes its not using derby i think: i can see the tables in my actual hive > metastore. > > i was using a syml

Re: Is there a any plan to develop SPARK with c++??

2016-02-03 Thread Benjamin Kim
Hi DaeJin, The closest thing I can think of is this. https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html Cheers, Ben > On Feb 3, 2016, at 9:49 PM, DaeJin Jung wrote: > > hello everyone, > I have a short question. > > I would like to improve perfor

Re: Spark with SAS

2016-02-03 Thread Benjamin Kim
You can download the Spark ODBC Driver. https://databricks.com/spark/odbc-driver-download > On Feb 3, 2016, at 10:09 AM, Jörn Franke wrote: > > This could be done through odbc. Keep in mind that you can run SaS jobs > directly on a Hadoop cluster using the SaS embedded process engine or dump

Re: [ANNOUNCE] New SAMBA Package = Spark + AWS Lambda

2016-02-02 Thread Benjamin Kim
Hi David, My company uses Lamba to do simple data moving and processing using python scripts. I can see using Spark instead for the data processing would make it into a real production level platform. Does this pave the way into replacing the need of a pre-instantiated cluster in AWS or bought

Re: Spark SQL 1.5.2 missing JDBC driver for PostgreSQL?

2015-12-26 Thread Benjamin Kim
t; but I couldn't get this to work for whatever reason, so i'm sticking to the > --jars approach used in my examples. > > On Tue, Dec 22, 2015 at 9:51 PM, Benjamin Kim <mailto:bbuil...@gmail.com>> wrote: > Stephen, > > Let me confirm. I just need to propagat

Re: Spark SQL 1.5.2 missing JDBC driver for PostgreSQL?

2015-12-25 Thread Benjamin Kim
e spark.worker.cleanup.appDataTtl config param. > > The Spark SQL programming guide says to use SPARK_CLASSPATH for this purpose, > but I couldn't get this to work for whatever reason, so i'm sticking to the > --jars approach used in my examples. > > On Tue, Dec 22,

Re: Spark SQL 1.5.2 missing JDBC driver for PostgreSQL?

2015-12-22 Thread Benjamin Kim
18:35 GMT-08:00 Benjamin Kim >: > >> Hi Stephen, >> >> I forgot to mention that I added these lines below to the >> spark-default.conf on the node with Spark SQL Thrift JDBC/ODBC Server >> running on it. Then, I restarted it. >> >> spark.drive

Re: Spark SQL 1.5.2 missing JDBC driver for PostgreSQL?

2015-12-22 Thread Benjamin Kim
Hi Stephen, I forgot to mention that I added these lines below to the spark-default.conf on the node with Spark SQL Thrift JDBC/ODBC Server running on it. Then, I restarted it. spark.driver.extraClassPath=/usr/share/java/postgresql-9.3-1104.jdbc41.jar spark.executor.extraClassPath=/usr/share/ja

<    1   2