Re: Time series data

2018-05-24 Thread Vadim Semenov
rox. 360 GB per day) – Therefore, we will end up with > 10's or 100's of TBs of data and I feel that NoSQL will be much quicker > than Hadoop/Spark. This is time series data that are coming from many > devices in form of flat files and it is currently extracted / transformed > /loaded &g

Re: Time series data

2018-05-24 Thread Jörn Franke
> using spark with any nosql or TSDB ? We receive 1 mil meters x 288 readings = > 288 mil rows (Approx. 360 GB per day) – Therefore, we will end up with 10's > or 100's of TBs of data and I feel that NoSQL will be much quicker than > Hadoop/Spark. This is time series data that a

Time series data

2018-05-24 Thread amin mohebbi
thanHadoop/Spark. This is time series data that are coming from many devices in form of flat files and it is currently extracted / transformed /loaded  into another database which is connected to BI tools. We might use azure data factory to collect the flat files and then use spark to do the ETL

RE: Spark job for Reading time series data from Cassandra

2016-03-10 Thread Prateek .
<bryan.jeff...@gmail.com> Cc: Prateek . <prat...@aricent.com>; user@spark.apache.org Subject: Re: Spark job for Reading time series data from Cassandra Hi, the spark connector docs say: (https://github.com/datastax/spark-cassandra-connector/blob/master/doc/FAQ.md) "Th

Re: Spark job for Reading time series data from Cassandra

2016-03-10 Thread Matthias Niehoff
Hi, the spark connector docs say: ( https://github.com/datastax/spark-cassandra-connector/blob/master/doc/FAQ.md ) "The number of Spark partitions(tasks) created is directly controlled by the setting spark.cassandra.input.split.size_in_mb. This number reflects the approximate amount of Cassandra

Re: Spark job for Reading time series data from Cassandra

2016-03-10 Thread Bryan Jeffrey
Prateek, I believe that one task is created per Cassandra partition. How is your data partitioned? Regards, Bryan Jeffrey On Thu, Mar 10, 2016 at 10:36 AM, Prateek . wrote: > Hi, > > > > I have a Spark Batch job for reading timeseries data from Cassandra which > has

Spark job for Reading time series data from Cassandra

2016-03-10 Thread Prateek .
Hi, I have a Spark Batch job for reading timeseries data from Cassandra which has 50,000 rows. JavaRDD cassandraRowsRDD = javaFunctions.cassandraTable("iotdata", "coordinate") .map(new Function() { @Override public

Re: Content based window operation on Time-series data

2015-12-17 Thread Sandy Ryza
;> wrote: >> > Hi all, >> > >> > We have RDD(main) of sorted time-series data. We want to split it into >> > different RDDs according to window size and then perform some >> aggregation >> > operation like max, min etc. over each RDD in parall

Re: Content based window operation on Time-series data

2015-12-17 Thread Davies Liu
Could you try this? df.groupBy(cast(col("timeStamp") - start) / bucketLengthSec, IntegerType)).agg(max("timestamp"), max("value")).collect() On Wed, Dec 9, 2015 at 8:54 AM, Arun Verma <arun.verma...@gmail.com> wrote: > Hi all, > > We have RDD(main)

Content based window operation on Time-series data

2015-12-09 Thread Arun Verma
Hi all, *We have RDD(main) of sorted time-series data. We want to split it into different RDDs according to window size and then perform some aggregation operation like max, min etc. over each RDD in parallel.* If window size is w then ith RDD has data from (startTime + (i-1)*w) to (startTime

Re: Content based window operation on Time-series data

2015-12-09 Thread Sean Owen
CC Sandy as his https://github.com/cloudera/spark-timeseries might be of use here. On Wed, Dec 9, 2015 at 4:54 PM, Arun Verma <arun.verma...@gmail.com> wrote: > Hi all, > > We have RDD(main) of sorted time-series data. We want to split it into > different RDDs accord

Re: Content based window operation on Time-series data

2015-12-09 Thread Arun Verma
5 at 4:54 PM, Arun Verma <arun.verma...@gmail.com> > wrote: > > Hi all, > > > > We have RDD(main) of sorted time-series data. We want to split it into > > different RDDs according to window size and then perform some aggregation > > operation like max, min et

Re: Time series data

2015-06-29 Thread tog
with multiple time series data and in summary I have to adjust each time series (like inserting average values in data gaps) and then training regression models with mllib for each time series. The adjustment step I did with the adjustement function being mapped for each element of RDD (in this case being

Time series data

2015-06-26 Thread Caio Cesar Trucolo
Hi everyone! I am working with multiple time series data and in summary I have to adjust each time series (like inserting average values in data gaps) and then training regression models with mllib for each time series. The adjustment step I did with the adjustement function being mapped for each

Re: Grouping and storing unordered time series data stream to HDFS

2015-05-16 Thread Nisrina Luthfiyati
(by time dimension). I'm wondering what are the best practices in grouping/storing time series data stream using Spark Streaming? I'm considering grouping each batch of data in Spark Streaming per time dimension and then saving each group to different hdfs directories. However since

Re: Grouping and storing unordered time series data stream to HDFS

2015-05-16 Thread Helena Edelson
within the same time dimension to the same hdfs directory. The data stream might be unordered (by time dimension). I'm wondering what are the best practices in grouping/storing time series data stream using Spark Streaming? I'm considering grouping each batch of data in Spark Streaming

Grouping and storing unordered time series data stream to HDFS

2015-05-15 Thread Nisrina Luthfiyati
what are the best practices in grouping/storing time series data stream using Spark Streaming? I'm considering grouping each batch of data in Spark Streaming per time dimension and then saving each group to different hdfs directories. However since it is possible for data with the same time

Re: Grouping and storing unordered time series data stream to HDFS

2015-05-15 Thread ayan guha
in grouping/storing time series data stream using Spark Streaming? I'm considering grouping each batch of data in Spark Streaming per time dimension and then saving each group to different hdfs directories. However since it is possible for data with the same time dimension to be in different

Re: How to preserve/preset partition information when load time series data?

2015-03-16 Thread Imran Rashid
changes for HadoopPartition the compute() method. (or if you can't subclass HadoopRDD directly you can use it for inspiration.) On Mon, Mar 9, 2015 at 11:18 AM, Shuai Zheng szheng.c...@gmail.com wrote: Hi All, If I have a set of time series data files, they are in parquet format

Re: How to preserve/preset partition information when load time series data?

2015-03-11 Thread Imran Rashid
, Mar 9, 2015 at 11:18 AM, Shuai Zheng szheng.c...@gmail.com wrote: Hi All, If I have a set of time series data files, they are in parquet format and the data for each day are store in naming convention, but I will not know how many files for one day. 20150101a.parq 20150101b.parq

How to preserve/preset partition information when load time series data?

2015-03-09 Thread Shuai Zheng
Hi All, If I have a set of time series data files, they are in parquet format and the data for each day are store in naming convention, but I will not know how many files for one day. 20150101a.parq 20150101b.parq 20150102a.parq 20150102b.parq 20150102c.parq . 201501010a.parq

Re: Dealing with Time Series Data

2014-09-17 Thread qihong
what are you trying to do? generate time series from your data in HDFS, or doing some transformation and/or aggregation from your time series data in HDFS? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Dealing-with-Time-Series-Data-tp14275p14482.html Sent

Dealing with Time Series Data

2014-09-15 Thread Gary Malouf
I have a use case for our data in HDFS that involves sorting chunks of data into time series format by a specific characteristic and doing computations from that. At large scale, what is the most efficient way to do this? Obviously, having the data sharded by that characteristic would make the