rox. 360 GB per day) – Therefore, we will end up with
> 10's or 100's of TBs of data and I feel that NoSQL will be much quicker
> than Hadoop/Spark. This is time series data that are coming from many
> devices in form of flat files and it is currently extracted / transformed
> /loaded
&g
> using spark with any nosql or TSDB ? We receive 1 mil meters x 288 readings =
> 288 mil rows (Approx. 360 GB per day) – Therefore, we will end up with 10's
> or 100's of TBs of data and I feel that NoSQL will be much quicker than
> Hadoop/Spark. This is time series data that a
thanHadoop/Spark. This
is time series data that are coming from many devices in form of flat files and
it is currently extracted / transformed /loaded into another database which is
connected to BI tools. We might use azure data factory to collect the flat
files and then use spark to do the ETL
<bryan.jeff...@gmail.com>
Cc: Prateek . <prat...@aricent.com>; user@spark.apache.org
Subject: Re: Spark job for Reading time series data from Cassandra
Hi,
the spark connector docs say:
(https://github.com/datastax/spark-cassandra-connector/blob/master/doc/FAQ.md)
"Th
Hi,
the spark connector docs say: (
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/FAQ.md
)
"The number of Spark partitions(tasks) created is directly controlled by
the setting spark.cassandra.input.split.size_in_mb. This number reflects
the approximate amount of Cassandra
Prateek,
I believe that one task is created per Cassandra partition. How is your
data partitioned?
Regards,
Bryan Jeffrey
On Thu, Mar 10, 2016 at 10:36 AM, Prateek . wrote:
> Hi,
>
>
>
> I have a Spark Batch job for reading timeseries data from Cassandra which
> has
Hi,
I have a Spark Batch job for reading timeseries data from Cassandra which has
50,000 rows.
JavaRDD cassandraRowsRDD = javaFunctions.cassandraTable("iotdata",
"coordinate")
.map(new Function() {
@Override
public
;> wrote:
>> > Hi all,
>> >
>> > We have RDD(main) of sorted time-series data. We want to split it into
>> > different RDDs according to window size and then perform some
>> aggregation
>> > operation like max, min etc. over each RDD in parall
Could you try this?
df.groupBy(cast(col("timeStamp") - start) / bucketLengthSec,
IntegerType)).agg(max("timestamp"), max("value")).collect()
On Wed, Dec 9, 2015 at 8:54 AM, Arun Verma <arun.verma...@gmail.com> wrote:
> Hi all,
>
> We have RDD(main)
Hi all,
*We have RDD(main) of sorted time-series data. We want to split it into
different RDDs according to window size and then perform some aggregation
operation like max, min etc. over each RDD in parallel.*
If window size is w then ith RDD has data from (startTime + (i-1)*w) to
(startTime
CC Sandy as his https://github.com/cloudera/spark-timeseries might be
of use here.
On Wed, Dec 9, 2015 at 4:54 PM, Arun Verma <arun.verma...@gmail.com> wrote:
> Hi all,
>
> We have RDD(main) of sorted time-series data. We want to split it into
> different RDDs accord
5 at 4:54 PM, Arun Verma <arun.verma...@gmail.com>
> wrote:
> > Hi all,
> >
> > We have RDD(main) of sorted time-series data. We want to split it into
> > different RDDs according to window size and then perform some aggregation
> > operation like max, min et
with multiple time series data and in summary I have to
adjust each time series (like inserting average values in data gaps) and
then training regression models with mllib for each time series. The
adjustment step I did with the adjustement function being mapped for each
element of RDD (in this case being
Hi everyone!
I am working with multiple time series data and in summary I have to adjust
each time series (like inserting average values in data gaps) and then
training regression models with mllib for each time series. The adjustment
step I did with the adjustement function being mapped for each
(by time dimension).
I'm wondering what are the best practices in grouping/storing time series
data stream using Spark Streaming?
I'm considering grouping each batch of data in Spark Streaming per time
dimension and then saving each group to different hdfs directories. However
since
within the same
time dimension to the same hdfs directory. The data stream might be unordered
(by time dimension).
I'm wondering what are the best practices in grouping/storing time series
data stream using Spark Streaming?
I'm considering grouping each batch of data in Spark Streaming
what are the best practices in grouping/storing time series
data stream using Spark Streaming?
I'm considering grouping each batch of data in Spark Streaming per time
dimension and then saving each group to different hdfs directories. However
since it is possible for data with the same time
in grouping/storing time series
data stream using Spark Streaming?
I'm considering grouping each batch of data in Spark Streaming per time
dimension and then saving each group to different hdfs directories. However
since it is possible for data with the same time dimension to be in
different
changes for HadoopPartition the compute() method.
(or if you can't subclass HadoopRDD directly you can use it for
inspiration.)
On Mon, Mar 9, 2015 at 11:18 AM, Shuai Zheng szheng.c...@gmail.com
wrote:
Hi All,
If I have a set of time series data files, they are in parquet format
, Mar 9, 2015 at 11:18 AM, Shuai Zheng szheng.c...@gmail.com wrote:
Hi All,
If I have a set of time series data files, they are in parquet format and
the data for each day are store in naming convention, but I will not know
how many files for one day.
20150101a.parq
20150101b.parq
Hi All,
If I have a set of time series data files, they are in parquet format and
the data for each day are store in naming convention, but I will not know
how many files for one day.
20150101a.parq
20150101b.parq
20150102a.parq
20150102b.parq
20150102c.parq
.
201501010a.parq
what are you trying to do? generate time series from your data in HDFS, or
doing
some transformation and/or aggregation from your time series data in HDFS?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Dealing-with-Time-Series-Data-tp14275p14482.html
Sent
I have a use case for our data in HDFS that involves sorting chunks of data
into time series format by a specific characteristic and doing computations
from that. At large scale, what is the most efficient way to do this?
Obviously, having the data sharded by that characteristic would make the
23 matches
Mail list logo