That would be the hard way, but if possible I want to clear the cache
without stopping the application, maybe triggered by a message in the
stream.
Am 17. April 2017 um 19:41 schrieb ayan guha :
> It sounds like you want to stop the stream process, wipe out the check
> point and restart?
>
> On M
Hello Gaurav,
Pre-calculating the results would obviously be a great idea - and load the
results into a serving store from where you serve it out to your customers
- as suggested by Jorn.
And run it every hour/day, depending on your requirements.
Zeppelin (as mentioned by Ayan) would not be a go
@Ayan - Creating temp table dynamically based on dataset name. I will
explore df.saveAsTable option.
On Mon, Apr 17, 2017 at 9:53 PM, Ryan wrote:
> It shouldn't be a problem then. We've done the similar thing in scala. I
> don't have much experience with python thread but maybe the code related
It shouldn't be a problem then. We've done the similar thing in scala. I
don't have much experience with python thread but maybe the code related
with reading/writing temp table isn't thread safe.
On Mon, Apr 17, 2017 at 9:45 PM, Amol Patil wrote:
> Thanks Ryan,
>
> Each dataset has separate hiv
Dear fellow Spark users,
*Use case :* I have written a small java client which launches multiple
Spark jobs through *SparkLauncher* and captures jobs' metrics during the
course of the execution.
*Issue :* Sometimes the client fails saying -
*Caused by:
org.apache.hadoop.ipc.RemoteException(org.ap
The easiest way I found was to take a look at the source. Any receiver that
calls the version of store that requires an iterator is considered reliable. A
definitive list would be nice.
Kind Regards
- Original Message -
From: "Justin Pihony"
To: "user"
Sent: Monday, 17 April, 2017 20:
I can't seem to find anywhere that would let a user know if the receiver they
are using is reliable or not. Even better would be a list of known reliable
receivers. Are any of these things possible? Or do you just have to research
your receiver beforehand?
--
View this message in context:
http:
i dont see this behavior in the current spark master:
scala> val df = Seq("m_123", "m_111", "m_145", "m_098",
"m_666").toDF("msrid")
df: org.apache.spark.sql.DataFrame = [msrid: string]
scala> df.filter($"msrid".isin("m_123")).count
res0: Long =
1
scala> df.filter($"msrid".isin("m_123","m_111","
Hello All,
Does anyone know if the skew handling code mentioned in this talk
https://www.youtube.com/watch?v=bhYV0JOPd9Y was added to spark?
If so can I know where to look for more info, JIRA? Pull request?
Thanks in advance.
Regards,
Vishnu Viswanath.
Thanks for responding.
df.filter($”msrid”===“m_123” || $”msrid”===“m_111”)
there are lots of workaround to my question but Can you let know whats wrong
with the “isin” query.
Regards,
Nayan
> Begin forwarded message:
>
> From: ayan guha
> Subject: Re: isin query
> Date: 17 April 2017 at 8:13:
How about using OR operator in filter?
On Tue, 18 Apr 2017 at 12:35 am, nayan sharma
wrote:
> Dataframe (df) having column msrid(String) having values
> m_123,m_111,m_145,m_098,m_666
>
> I wanted to filter out rows which are having values m_123,m_111,m_145
>
> df.filter($"msrid".isin("m_123","m_
Dataframe (df) having column msrid(String) having values
m_123,m_111,m_145,m_098,m_666
I wanted to filter out rows which are having values m_123,m_111,m_145
df.filter($"msrid".isin("m_123","m_111","m_145")).count
count =0
while
df.filter($"msrid".isin("m_123")).count
count=121212
I have tried
Yes 5 mb is a difficult size, too small for HDFS too big for parquet/orc.
Maybe you can put the data in a HAR and store id, path in orc/parquet.
> On 17. Apr 2017, at 10:52, 莫涛 wrote:
>
> Hi Jörn,
>
>
>
> I do think a 5 MB column is odd but I don't have any other idea before asking
> this q
Dataframe (df) having column msrid(String) having values
m_123,m_111,m_145,m_098,m_666
I wanted to filter out rows which are having values m_123,m_111,m_145
df.filter($"msrid".isin("m_123","m_111","m_145")).count
count =0
while
df.filter($"msrid".isin("m_123")).count
count=121212
I have tried
One possihility is using hive with bucketed on id column?
Another option: build the index in hbase ie store id and path of hdfs in
hbase. This was your scans will be fast and once you have the hdfs path
pointers you can read the actual data from hdfs.
On Mon, 17 Apr 2017 at 6:52 pm, 莫涛 wrote:
>
Zeppelin is more useful for interactive data exploration. If tye reports
are known beforehand then any good reporting tool should work, such as
tablaue, qlic, power bi etc. zeppelin is not fit for this use case.
On Mon, 17 Apr 2017 at 6:57 pm, Gaurav Pandya
wrote:
> Thanks Jorn. Yes, I will prec
It sounds like you want to stop the stream process, wipe out the check
point and restart?
On Mon, 17 Apr 2017 at 10:13 pm, Matthias Niehoff <
matthias.nieh...@codecentric.de> wrote:
> Hi everybody,
>
> is there a way to complete invalidate or remove the state used by
> mapWithState, not only for
What happens if you do not use the temp table, but directly do
df.saveAsTsble with mode append? If i have to guess without looking at the
code of your task function, i would think the name if temp table is
evaluated statically, so all threads are refering to same tsble. In other
words your app is n
Thanks Ryan,
Each dataset has separate hive table. All hive tables belongs to same hive
database.
The idea is to ingest data in parallel in respective hive tables.
If I run code sequentially for each data source, it works fine but I will
take lot of time. We are planning to process around 30-40
On Mon, Apr 17, 2017 at 3:25 PM, Zeming Yu wrote:
> I've got a dataframe with a column looking like this:
>
> display(flight.select("duration").show())
>
> ++
> |duration|
> ++
> | 15h10m|
> | 17h0m|
> | 21h25m|
> | 14h25m|
> | 14h30m|
> ++
> only showing top 20 rows
Good to know it worked. In case some of the job still failed can indicate
skew in your dataset. You may want to think of a partition by function.
Also, do you still see containers killed by yarn? If so, at what point? You
should see something like your app is trying to use x gb while yarn can
prov
I've got a dataframe with a column looking like this:
display(flight.select("duration").show())
++
|duration|
++
| 15h10m|
| 17h0m|
| 21h25m|
| 14h30m|
| 24h50m|
| 26h10m|
| 14h30m|
| 23h5m|
| 21h30m|
| 11h50m|
| 16h10m|
| 15h15m|
| 21h25m|
| 14h25m|
| 14h40m|
|
Hi everybody,
is there a way to complete invalidate or remove the state used by
mapWithState, not only for a given key using State#remove()?
Deleting the state key by key is not an option, as a) not all possible keys
are known(might be work around of course) and b) the number of keys is to
big an
I am playing with some data using (stand alone) spark-shell (Spark version
1.6.0) by executing `spark-shell`. The flow is simple; a bit like cp -
basically moving local 100k files (the max size is 190k) to S3. Memory is
configured as below
export SPARK_DRIVER_MEMORY=8192M
export SPARK_WORKER_C
Thanks Jorn. Yes, I will precalculate the results. Do you think Zeppelin
can work here?
On Mon, Apr 17, 2017 at 1:41 PM, Jörn Franke wrote:
> Processing through Spark is fine, but I do not recommend that each of the
> users triggers a Spark query. So either you precalculate the reports in
> Spar
Hi Jörn,
I do think a 5 MB column is odd but I don't have any other idea before asking
this question. The binary data is a short video and the maximum size is no more
than 50 MB.
Hadoop archive sounds very interesting and I'll try it first to check whether
filtering is fast on it.
To my be
how about the event timeline on executors? It seems add more executor could
help.
1. I found a jira(https://issues.apache.org/jira/browse/SPARK-11621) that
states the ppd should work. And I think "only for matched ones the binary
data is read" is true if proper index is configured. The row group w
Hi Ryan,
The attachment is a screen shot for the spark job and this is the only stage
for this job.
I've changed the partition size to 1GB by "--conf
spark.sql.files.maxPartitionBytes=1073741824".
1. spark-orc seems not that smart. The input size is almost the whole data. I
guess "only for
Processing through Spark is fine, but I do not recommend that each of the users
triggers a Spark query. So either you precalculate the reports in Spark so that
the reports themselves do not trigger Spark queries or you have a database that
serves the report. For the latter case there are tons of
You need to sort the data by id otherwise q situation can occur where the index
does not work. Aside from this, it sounds odd to put a 5 MB column using those
formats. This will be also not so efficient.
What is in the 5 MB binary data?
You could use HAR or maybe Hbase to store this kind of dat
Thanks for the revert Jorn.
In my case, I am going to put the analysis on e-commerce website so
naturally users will be more and it will keep growing when e-commerce
website captures market. Users will not be doing any analysis here. Reports
will show their purchasing behaviour and pattern (kind of
1. Per my understanding, for orc files, it should push down the filters,
which means all id columns will be scanned but only for matched ones the
binary data is read. I haven't dig into spark-orc reader though..
2. orc itself have row group index and bloom filter index. you may try
configurations
I think it highly depends on your requirements. There are various tools for
analyzing and visualizing data. How many concurrent users do you have? What
analysis do they do? How much data is involved? Do they have to process the
data all the time or can they live with sampling which increases per
Hi Ryan,
1. "expected qps and response time for the filter request"
I expect that only the requested BINARY are scanned instead of all records, so
the response time would be "10K * 5MB / disk read speed", or several times of
this.
In practice, our cluster has 30 SAS disks and scanning all the
34 matches
Mail list logo