Re: [Arrow][Dremio]

2018-05-14 Thread Pierce Lamb
Hi Xavier, Along the lines of connecting to multiple sources of data and replacing ETL tools you may want to check out Confluent's blog on building a real-time streaming ETL pipeline on Kafka as well as

Re: Streaming Analytics/BI tool to connect Spark SQL

2017-12-07 Thread Pierce Lamb
Hi Umar, While this answer is a bit dated, you make find it useful in diagnosing a store for Spark SQL tables: https://stackoverflow.com/a/39753976/3723346 I don't know much about Pentaho or Arcadia, but I assume many of the listed options have a JDBC or ODBC client. Hope this helps, Pierce

Re: Update MySQL table via Spark/SparkR?

2017-08-22 Thread Pierce Lamb
Hi Jake, There is a another option within the 3rd party projects in the spark database ecosystem that have combined Spark with a DBMS in such a way that DataFrame API has been extended to include UPDATE operations

Re: using Kudu with Spark

2017-07-24 Thread Pierce Lamb
Hi Mich, I tried to compile a list of datastores that connect to Spark and provide a bit of context. The list may help you in your research: https://stackoverflow.com/a/39753976/3723346 I'm going to add Kudu, Druid and Ampool from this thread. I'd like to point out SnappyData

Re: "Sharing" dataframes...

2017-06-21 Thread Pierce Lamb
Hi Jean, Since many in this thread have mentioned datastores from what I would call the "Spark datastore ecosystem" I thought I would link you to a StackOverflow answer I posted awhile back that tried to capture the majority of this ecosystem. Most would claim to allow you to do something like

Re: [Spark Streaming] Dynamic Broadcast Variable Update

2017-05-05 Thread Pierce Lamb
Hi Nipun, To expand a bit, you might find this stackoverflow answer useful: http://stackoverflow.com/a/39753976/3723346 Most spark + database combinations can handle a use case like this. Hope this helps, Pierce On Thu, May 4, 2017 at 9:18 AM, Gene Pang wrote: > As Tim

Re: Spark Streaming. Real-time save data and visualize on dashboard

2017-04-11 Thread Pierce Lamb
Hi, It is possible to use Mongo or Cassandra to persist results from Spark. In fact, a wide variety of data stores are available to use with Spark and many are aimed at serving queries for dashboard visualizations. I cannot comment on which work well with Grafana or Kabana, however, I've listed

Re: Apache Drill vs Spark SQL

2017-04-07 Thread Pierce Lamb
Hi Kant, If you are interested in using Spark alongside a database to serve real time queries, there are many options. Almost every popular database has built some sort of connector to Spark. I've listed a majority of them and tried to delineate them in some way in this StackOverflow answer:

Re: How best we can store streaming data on dashboards for real time user experience?

2017-03-30 Thread Pierce Lamb
SnappyData should work well for what you want, it deeply integrates an in-memory database with Spark which supports ingesting streaming data and concurrently querying it from a dashboard. SnappyData currently has an integration with Apache Zeppelin (notebook visualization) and soon it will have

Re: Appropriate Apache Users List Uses

2016-02-09 Thread Pierce Lamb
..@omernik.com> wrote: > >> All, I received this today, is this appropriate list use? Note: This was >> unsolicited. >> >> Thanks >> John >> >> >> >> From: Pierce Lamb <pl...@snappydata.io> >> 11:57 AM (1 hour ago) >> to me >>

MLlib/kmeans newbie question(s)

2015-03-07 Thread Pierce Lamb
Hi all, I'm very new to machine learning algorithms and Spark. I'm follow the Twitter Streaming Language Classifier found here: http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/README.html Specifically this code:

Re: Help with updateStateByKey

2014-12-18 Thread Pierce Lamb
map on None returns None. Instead, try: Some(currentValue.getOrElse(Seq.empty) ++ newValues) I think that should give you the expected result. From: Pierce Lamb richard.pierce.l...@gmail.com Date: Thursday, December 18, 2014 at 2:31 PM To: Silvio Fiorito silvio.fior...@granturing.com Cc

Help with updateStateByKey

2014-12-17 Thread Pierce Lamb
I am trying to run stateful Spark Streaming computations over (fake) apache web server logs read from Kafka. The goal is to sessionize the web traffic similar to this blog post: http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/