Re: [ANNOUNCE] Apache Spark 3.1.2 released

2021-06-01 Thread Takeshi Yamamuro
e would like to acknowledge all community members for contributing to >>> this >>> release. This release would not have been possible without you. >>> >>> Dongjoon Hyun >>> >> > > -- > > -- --- Takeshi Yamamuro

Re: [ANNOUNCE] Announcing Apache Spark 3.1.1

2021-03-02 Thread Takeshi Yamamuro
> and early feedback to >>> this release. This release would not have been possible without you. >>> >>> To download Spark 3.1.1, head over to the download page: >>> http://spark.apache.org/downloads.html >>> >>> To view the release notes: >>> https://spark.apache.org/releases/spark-release-3-1-1.html >>> >>> -- --- Takeshi Yamamuro

Re: Spark SQL Dataset and BigDecimal

2021-02-17 Thread Takeshi Yamamuro
y performance penalty for using scala BigDecimal? it's more > convenient from an API point of view than java.math.BigDecimal. > -- --- Takeshi Yamamuro

Re: Custom JdbcConnectionProvider

2020-10-27 Thread Takeshi Yamamuro
2020 at 2:31 PM Takeshi Yamamuro > wrote: > >> Hi, >> >> Please see an example code in >> https://github.com/gaborgsomogyi/spark-jdbc-connection-provider ( >> https://github.com/apache/spark/pull/29024). >> Since it depends on the service loader, I think you

Re: Custom JdbcConnectionProvider

2020-10-27 Thread Takeshi Yamamuro
they are not used. Do I need to register somehow > them? Could someone share a relevant example? > Thx. > > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > --------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- --- Takeshi Yamamuro

Re: [ANNOUNCE] Announcing Apache Spark 3.0.1

2020-09-11 Thread Takeshi Yamamuro
org/releases/spark-release-3-0-1.html >> >> We would like to acknowledge all community members for contributing to >> this release. This release would not have been possible without you. >> >> >> Thanks, >> Ruifeng Zheng >> >> -- --- Takeshi Yamamuro

Re: [ANNOUNCE] Apache Spark 3.0.0

2020-06-18 Thread Takeshi Yamamuro
lease would not have been possible > without you. > > To download Spark 3.0.0, head over to the download page: > http://spark.apache.org/downloads.html > > To view the release notes: > https://spark.apache.org/releases/spark-release-3-0-0.html > > > > -- --- Takeshi Yamamuro

Re: [ANNOUNCE] Apache Spark 2.4.6 released

2020-06-10 Thread Takeshi Yamamuro
t;> Note that you might need to clear your browser cache or >>> to use `Private`/`Incognito` mode according to your browsers. >>> >>> To view the release notes: >>> https://spark.apache.org/releases/spark-release-2.4.6.html >>> >>> We would like to acknowledge all community members for contributing to >>> this >>> release. This release would not have been possible without you. >>> >> -- --- Takeshi Yamamuro

Re: [ANNOUNCE] Announcing Apache Spark 2.4.5

2020-02-08 Thread Takeshi Yamamuro
your browsers. >> >> To view the release notes: >> https://spark.apache.org/releases/spark-release-2.4.5.html >> >> We would like to acknowledge all community members for contributing to >> this >> release. This release would not have been possible without you. >> >> Dongjoon Hyun >> > -- --- Takeshi Yamamuro

Re: [ANNOUNCE] Announcing Apache Spark 3.0.0-preview2

2019-12-24 Thread Takeshi Yamamuro
ble >> without you. >> >> To download Spark 3.0.0-preview2, head over to the download page: >> https://archive.apache.org/dist/spark/spark-3.0.0-preview2 >> >> Happy Holidays. >> >> Yuming >> > > > -- > [image: Databricks Summit - Watch the talks] > <https://databricks.com/sparkaisummit/north-america> > -- --- Takeshi Yamamuro

Re: Release Apache Spark 2.4.4

2019-08-13 Thread Takeshi Yamamuro
waiting for > SPARK-27900. > > Please let me know if there is another issue. > > > > Thanks, > > Dongjoon. > > ----- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- --- Takeshi Yamamuro

[ANNOUNCE] Announcing Apache Spark 2.3.3

2019-02-17 Thread Takeshi Yamamuro
://spark.apache.org/downloads.html To view the release notes: https://spark.apache.org/releases/spark-release-2-3-3.html We would like to acknowledge all community members for contributing to this release. This release would not have been possible without you. Best, Takeshi -- --- Takeshi Yamamuro

Re: [ANNOUNCE] Announcing Apache Spark 2.2.3

2019-01-16 Thread Takeshi Yamamuro
p://apache-spark-user-list.1001560.n3.nabble.com/ >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> -- --- Takeshi Yamamuro

Re: Spark jdbc postgres numeric array

2019-01-04 Thread Takeshi Yamamuro
Hi, I filed a jira: https://issues.apache.org/jira/browse/SPARK-26540 On Thu, Jan 3, 2019 at 10:04 PM Takeshi Yamamuro wrote: > Hi, > > I checked that v2.2/v2.3/v2.4/master had the same issue, so can you file a > jira? > I looked over the related code and then I think we n

Re: Spark jdbc postgres numeric array

2019-01-03 Thread Takeshi Yamamuro
- > Поторопись зарегистрировать самый короткий почтовый адрес @i.ua > https://mail.i.ua/reg - и получи 1Gb для хранения писем > > ----- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- --- Takeshi Yamamuro

Re: Spark 2.3.1 not working on Java 10

2018-06-21 Thread Takeshi Yamamuro
er-list.1001560.n3.nabble.com/ >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> > > > -- > Regards, > Vaquar Khan > +1 -224-436-0783 > Greater Chicago > -- --- Takeshi Yamamuro

Re: How to use disk instead of just InMemoryRelation when use JDBC datasource in SPARKSQL?

2018-04-12 Thread Takeshi Yamamuro
see, all JDBCRelation convert to InMemoryRelation. Cause the JDBC > table is so big, the all data can not be filled into memory, OOM occurs. > If there is some option to make SparkSQL use Disk if memory not enough? > -- --- Takeshi Yamamuro

Re: Using CBO on Spark 2.3 with analyzed hive tables

2018-03-23 Thread Takeshi Yamamuro
; > Best, > Michael > > > On Fri, Mar 23, 2018 at 1:51 PM, Takeshi Yamamuro <linguin@gmail.com> > wrote: > > hi, > > > > What's a query to reproduce this? > > It seems when casting double to BigDecimal, it throws the exception. > > > &

Re: Using CBO on Spark 2.3 with analyzed hive tables

2018-03-23 Thread Takeshi Yamamuro
n. > optimizedPlan(QueryExecution.scala:66) > > at org.apache.spark.sql.execution.QueryExecution$$ > anonfun$toString$2.apply(QueryExecution.scala:204) > > at org.apache.spark.sql.execution.QueryExecution$$ > anonfun$toString$2.apply(QueryExecution.scala:204) > > at org.apache.spark.sql.execution.QueryExecution. > stringOrError(QueryExecution.scala:100) > > at org.apache.spark.sql.execution.QueryExecution. > toString(QueryExecution.scala:204) > > at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId( > SQLExecution.scala:74) > > at org.apache.spark.sql.DataFrameWriter.runCommand( > DataFrameWriter.scala:654) > > at org.apache.spark.sql.DataFrameWriter.createTable( > DataFrameWriter.scala:458) > > at org.apache.spark.sql.DataFrameWriter.saveAsTable( > DataFrameWriter.scala:437) > > at org.apache.spark.sql.DataFrameWriter.saveAsTable( > DataFrameWriter.scala:393) > > > > This exception only comes, if the statistics exist for the hive tables > being used. > > Has anybody already seen something like this ? > Any assistance would be greatly appreciated! > > Best, > Michael > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- --- Takeshi Yamamuro

Re: Reading CSV with multiLine option invalidates encoding option.

2017-08-16 Thread Takeshi Yamamuro
;> new String(line.getBytes, 0, line.getLength, >> parser.options.charset)// < charset option is used here. >> } >> } >> >> val shouldDropHeader = parser.options.headerFlag && file.start == 0 >> UnivocityParser.parseIterator(lines, shouldDropHeader, parser, >> schema) >> } >> >> >> It seems like a bug. >> Is there anyone who had the same problem before? >> >> >> Best wishes, >> Han-Cheol >> >> -- >> == >> Han-Cheol Cho, Ph.D. >> Data scientist, Data Science Team, Data Laboratory >> NHN Techorus Corp. >> >> Homepage: https://sites.google.com/site/priancho/ >> == >> > > > > -- > == > Han-Cheol Cho, Ph.D. > Data scientist, Data Science Team, Data Laboratory > NHN Techorus Corp. > > Homepage: https://sites.google.com/site/priancho/ > == > -- --- Takeshi Yamamuro

Re: custom column types for JDBC datasource writer

2017-07-05 Thread Takeshi Yamamuro
tps://stackoverflow.com/questions/44927764/spark-jdbc- > oracle-long-string-fields > > Regards, > Georg > -- --- Takeshi Yamamuro

Re: UDF percentile_approx

2017-06-14 Thread Takeshi Yamamuro
0234514)") >> df.agg(e).show() >> >> and exception is >> >> org.apache.spark.sql.AnalysisException: Undefined function: >> 'percentile_approx'. This function is neither a registered temporary >> function nor a permanent function registered >> >> I've also tryid with callUDF >> >> Regards. >> >> -- >> Ing. Ivaldi Andres >> > > -- --- Takeshi Yamamuro

Re: [CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

2017-06-08 Thread Takeshi Yamamuro
" >>> >>> On 8. Jun 2017, at 03:04, Chanh Le <giaosu...@gmail.com> wrote: >>> >>> Hi Takeshi, Jörn Franke, >>> >>> The problem is even I increase the maxColumns it still have some lines >>> have larger columns than the one I s

Re: [CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

2017-06-07 Thread Takeshi Yamamuro
o parse next valid one? > Any libs can replace univocity in that job? > > Thanks & regards, > Chanh > -- > Regards, > Chanh > > -- --- Takeshi Yamamuro

Re: Documentation on "Automatic file coalescing for native data sources"?

2017-05-20 Thread Takeshi Yamamuro
SQL Programming Guide and Google was not helpful. >>> >>> -- >>> Daniel Siegmann >>> Senior Software Engineer >>> *SecurityScorecard Inc.* >>> 214 W 29th Street, 5th Floor >>> New York, NY 10001 >>> >>> -- > Best Regards, > Ayan Guha > -- --- Takeshi Yamamuro

Re: pyspark.sql.DataFrame write error to Postgres DB

2017-04-20 Thread Takeshi Yamamuro
89279272_0040_01_03/pyspark.zip/pyspark/worker.py", > line 106, in > func = lambda _, it: map(mapper, it) > File > "/home/hadoop/hdtmp/nm-local-dir/usercache/hadoop/appcache/application_1491889279272_0040/container_1491889279272_0040_01_03/pyspark.zip/pyspark/worker.py", > line 92, in > mapper = lambda a: udf(*a) > File > "/home/hadoop/hdtmp/nm-local-dir/usercache/hadoop/appcache/application_1491889279272_0040/container_1491889279272_0040_01_03/pyspark.zip/pyspark/worker.py", > line 70, in > return lambda *a: f(*a) > File "", line 3, in > TypeError: sequence item 0: expected string, NoneType found > > -- --- Takeshi Yamamuro

Re: From C* to DataFrames with JSON

2017-02-11 Thread Takeshi Yamamuro
t;, "b": "bar" } | > > > to Spark DataFrame: > > | id | a | b | > === > | 1 | 123 | xyz | > +--+--+-+ > | 2 | 3 | bar | > > > I'm using Spark 1.6 . > > Thanks > > > JF > -- --- Takeshi Yamamuro

Re: EC2 script is missing in Spark 2.0.0~2.1.0

2017-02-11 Thread Takeshi Yamamuro
Data Analytics > National University of Ireland, Galway > IDA Business Park, Dangan, Galway, Ireland > Web: http://www.reza-analytics.eu/index.html > <http://139.59.184.114/index.html> > > On 11 February 2017 at 12:43, Takeshi Yamamuro <linguin@gmail.com> > wrote: &g

Re: EC2 script is missing in Spark 2.0.0~2.1.0

2017-02-11 Thread Takeshi Yamamuro
MSc > PhD Researcher, INSIGHT Centre for Data Analytics > National University of Ireland, Galway > IDA Business Park, Dangan, Galway, Ireland > Web: http://www.reza-analytics.eu/index.html > <http://139.59.184.114/index.html> > -- --- Takeshi Yamamuro

Re: increasing cross join speed

2017-02-01 Thread Takeshi Yamamuro
(orgClassName1, orgClassName2,dist) > > }).toDF("orgClassName1","orgClassName2,"dist"); > > > > > > > -- --- Takeshi Yamamuro

Re: Kinesis streaming misunderstanding..?

2017-01-27 Thread Takeshi Yamamuro
gt; Maybe a naive question: why are you creating 1 Dstream per shard? It > should be one Dstream corresponding to kinesis stream, isn't it? > > On Fri, Jan 27, 2017 at 8:09 PM, Takeshi Yamamuro <linguin@gmail.com> > wrote: > >> Hi, >> >> Just a guess though

Re: Kinesis streaming misunderstanding..?

2017-01-27 Thread Takeshi Yamamuro
terval, for this particular example, the > driver prints out between 20 and 30 for the count value. I expected to see > the count operation parallelized across the cluster. I think I must just be > misunderstanding something fundamental! Can anyone point out where I'm > going wrong? > > Yours in confusion, > Graham > > -- --- Takeshi Yamamuro

Re: spark intermediate data fills up the disk

2017-01-27 Thread Takeshi Yamamuro
s and my > assumption is that these files had been there since the start of my > streaming application I should have checked the time stamp before doing rm > -rf. Please let me know if I am wrong > > Sent from my iPhone > > On Jan 26, 2017, at 4:24 PM, Takeshi Yamamuro <lingui

Re: Oracle JDBC - Spark SQL - Key Not Found: Scale

2017-01-26 Thread Takeshi Yamamuro
"bal" >> driver="oracle.jdbc.OracleDriver" >> df = sqlContext.read.jdbc(url=url,table=table,properties={"user": >> user,"password":password,"driver":driver}) >> >> >> Still the issue persists. >> >>

Re: spark intermediate data fills up the disk

2017-01-26 Thread Takeshi Yamamuro
rker.cleanup.enabled = true > > On Wed, Jan 25, 2017 at 11:30 AM, kant kodali <kanth...@gmail.com> wrote: > >> I have bunch of .index and .data files like that fills up my disk. I am >> not sure what the fix is? I am running spark 2.0.2 in stand alone mode >> >> Thanks! >> >> >> >> > > -- --- Takeshi Yamamuro

Re: Oracle JDBC - Spark SQL - Key Not Found: Scale

2017-01-26 Thread Takeshi Yamamuro
98) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke( > ReflectionEngine.java:381) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand. > java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:209) > at java.lang.Thread.run(Thread.java:745) > > > > -- > Best Regards, > Ayan Guha > -- --- Takeshi Yamamuro

Re: Dataframe fails to save to MySQL table in spark app, but succeeds in spark shell

2017-01-25 Thread Takeshi Yamamuro
t sun.reflect.DelegatingMethodAccessorImpl.invoke( > DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$ > deploy$SparkSubmit$$runMain(SparkSubmit.scala:738) > at org.apache.spark.deploy.SparkSubmit$.doRunMain$1( > SparkSubmit.scala:187) > at org.apache.spark.deploy.SparkSubmit$.submit( > SparkSubmit.scala:212) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit. > scala:126) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > > Any idea why it's happening? A possible bug in spark? > > Thanks, > Dzung. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- --- Takeshi Yamamuro

Re: Issue returning Map from UDAF

2017-01-25 Thread Takeshi Yamamuro
adPoolExecutor.java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run( > ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > > > Attached is the code that you can use to reproduce the error. > > Thanks > Ankur > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > -- --- Takeshi Yamamuro

Re: Catalyst Expression(s) - Cleanup

2017-01-25 Thread Takeshi Yamamuro
so internal resources can be cleaned up? > > I have seen Generators are allowed to terminate() but my Expression(s) do > not need to emit 0..N rows. > -- --- Takeshi Yamamuro

Re: freeing up memory occupied by processed Stream Blocks

2017-01-25 Thread Takeshi Yamamuro
D's > in our scenario are Strings coming from kinesis stream > > is there a way to explicitly purge RDD after last step in M/R process once > and for all ? > > thanks much! > > On Fri, Jan 20, 2017 at 2:35 AM, Takeshi Yamamuro <linguin@gmail.com> > wrote: > &

Re: printSchema showing incorrect datatype?

2017-01-24 Thread Takeshi Yamamuro
;) > x: org.apache.spark.sql.DataFrame = [x: string] > > scala> x.as[Array[Byte]].printSchema > root > |-- x: string (nullable = true) > > scala> x.as[Array[Byte]].map(x => x).printSchema > root > |-- value: binary (nullable = true) > > why does the first schema show string instead of binary? > -- --- Takeshi Yamamuro

Re: help!!!----issue with spark-sql type cast form long to longwritable

2017-01-24 Thread Takeshi Yamamuro
>> import java.util.*; >> import org.apache.hadoop.hive.serde2.objectinspector.*; >> import org.apache.hadoop.io.LongWritable; >> import org.apache.hadoop.io.Text; >> import org.apache.hadoop.hive.serde2.io.DoubleWritable; >> >> .Please let me know why it is making issue in spark when perfectly >> running fine on hive >> > -- --- Takeshi Yamamuro

Re: converting timestamp column to a java.util.Date

2017-01-23 Thread Takeshi Yamamuro
where > i have 1 timestamp column and a bunch of strings. i will need to > convert that > to something compatible with a mongo's ISODate > > kr > marco > > -- --- Takeshi Yamamuro

Re: freeing up memory occupied by processed Stream Blocks

2017-01-19 Thread Takeshi Yamamuro
ing to out of memory > exception on some > > is there a way to "release" these blocks free them up , app is sample m/r > > I attempted rdd.unpersist(false) in the code but that did not lead to > memory free up > > thanks much in advance! > -- --- Takeshi Yamamuro

Re: spark 2.02 error when writing to s3

2017-01-19 Thread Takeshi Yamamuro
> -- > > The Boston Consulting Group, Inc. > > This e-mail message may contain confidential and/or privileged > information. If you are not an addressee or otherwise authorized to receive > this message, you should not use, copy, disclose or take any action based > on this e-mail or any information contained in the message. If you have > received this material in error, please advise the sender immediately by > reply e-mail and delete this message. Thank you. > -- --- Takeshi Yamamuro

Re: is partitionBy of DataFrameWriter supported in 1.6.x?

2017-01-19 Thread Takeshi Yamamuro
:43 WARN hive.HiveContext$$anon$2: Persisting partitioned > data source relation `test`.`my_test` into Hive metastore in Spark SQL > specific format, which is NOT compatible with Hive. Input path(s): > hdfs://nameservice1/user/hive/warehouse/test.db/my_test > > looking at hdfs

Re: need a hive generic udf which also works on spark sql

2017-01-17 Thread Takeshi Yamamuro
body has a test and tried generic udf with object inspector > implementaion which sucessfully ran on both hive and spark-sql > > please share me the git hub link or source code file > > Thanks in advance > Sirisha > -- --- Takeshi Yamamuro

Re: partition size inherited from parent: auto coalesce

2017-01-16 Thread Takeshi Yamamuro
gt;x > 4.0) > ngauss_rdd2.count // 35 > ngauss_rdd2.partitions.size // 4 > > ----- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- --- Takeshi Yamamuro

Re: Apache Spark example split/merge shards

2017-01-16 Thread Takeshi Yamamuro
apache-spark-user-list. > 1001560.n3.nabble.com/Apache-Spark-example-split-merge-shards-tp28311.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- --- Takeshi Yamamuro

Re: [Spark Core] Re-using dataframes with limit() produces unexpected results

2017-01-12 Thread Takeshi Yamamuro
f limit. I do see that the intend for limit may be such that no two > limit paths should occur in a single DAG. > > What do you think? What is the correct explanation? > > Anton > -- --- Takeshi Yamamuro

Re: How to hint Spark to use HashAggregate() for UDAF

2017-01-09 Thread Takeshi Yamamuro
+- Scan ExistingRDD[key#0,nested#1, > nestedArray#2,nestedObjectArray#3,value#4L] > > How can I make Spark to use HashAggregate (like the count(*) expression) > instead of SortAggregate with my UDAF? > > Is it intentional? Is there an issue tracking this? > > --- > Regards, > Andy > -- --- Takeshi Yamamuro

Re: The spark hive udf can read broadcast the variables?

2016-12-18 Thread Takeshi Yamamuro
udf can read broadcast the variables? > -- --- Takeshi Yamamuro

Re: Managed memory leak : spark-2.0.2

2016-12-08 Thread Takeshi Yamamuro
tream data from wikipedia > available at https://ndownloader.figshare.com/files/5036392 > > Where could i read up more about managed memory leak. Any pointers on what > might be the issue would be highly helpful > > thanks > appu > > > > -- --- Takeshi Yamamuro

Re: How to disable write ahead logs?

2016-11-28 Thread Takeshi Yamamuro
nabled for WAL to work with HDFS. My > installation does not enable this HDFS feature, so I would like to disable > WAL in Spark. > > > > Thanks, > > Tim > > > -- --- Takeshi Yamamuro

Re: Why is shuffle write size so large when joining Dataset with nested structure?

2016-11-25 Thread Takeshi Yamamuro
To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- --- Takeshi Yamamuro

Re: spark streaming with kinesis

2016-11-20 Thread Takeshi Yamamuro
that to one worker only always ? > 2.If not - can I repartition stream data before processing? If yes how- > since JavaDStream has only one method repartition which takes number of > partitions and not the partitioner function ?So it will randomly > repartition the Dstream data. > > Than

Re: [SQL/Catalyst] Janino Generated Code Debugging

2016-11-17 Thread Takeshi Yamamuro
eakpoint to the location that calls it and attempt to step into the > code, or reference a line of the stacktrace that should take me into the > code. Any idea how to properly set Janino to debug the Catalyst-generated > code more directly? > > Best, > Alek > -- --- Takeshi Yamamuro

Re: AVRO File size when caching in-memory

2016-11-16 Thread Takeshi Yamamuro
gt; *Subject:* Re: AVRO File size when caching in-memory >>>> >>>> >>>> >>>> Anyone? >>>> >>>> >>>> >>>> On Tue, Nov 15, 2016 at 10:45 AM, Prithish <prith...@gmail.com> wrote: >>>> >>>> I am using 2.0.1 and databricks avro library 3.0.1. I am running this >>>> on the latest AWS EMR release. >>>> >>>> >>>> >>>> On Mon, Nov 14, 2016 at 3:06 PM, Jörn Franke <jornfra...@gmail.com> >>>> wrote: >>>> >>>> spark version? Are you using tungsten? >>>> >>>> >>>> > On 14 Nov 2016, at 10:05, Prithish <prith...@gmail.com> wrote: >>>> > >>>> > Can someone please explain why this happens? >>>> > >>>> > When I read a 600kb AVRO file and cache this in memory (using >>>> cacheTable), it shows up as 11mb (storage tab in Spark UI). I have tried >>>> this with different file sizes, and the size in-memory is always >>>> proportionate. I thought Spark compresses when using cacheTable. >>>> >>>> >>>> >>>> >>>> >>> >>> >> > -- --- Takeshi Yamamuro

Re: Spark Streaming: question on sticky session across batches ?

2016-11-15 Thread Takeshi Yamamuro
custom RDD can help to find the node for the key-->node. >> there is a getPreferredLocation() method. >> But not sure, whether this will be persistent or can vary for some edge >> cases? >> >> Thanks in advance for you help and time ! >> >> Regards, >> Manish >> > > -- --- Takeshi Yamamuro

Re: Spark SQL UDF - passing map as a UDF parameter

2016-11-15 Thread Takeshi Yamamuro
p://www.xactlycorp.com/email-click/> > > <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] > <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] > <https://twitter.com/Xactly> [image: Facebook] > <https://www.facebook.com/XactlyCorp> [image: YouTube] > <http://www.youtube.com/xactlycorporation> -- --- Takeshi Yamamuro

Re: spark streaming with kinesis

2016-11-14 Thread Takeshi Yamamuro
ss than 2 seconds ? > > Thanks! > > > > On Mon, Nov 14, 2016 at 7:36 PM, Takeshi Yamamuro <linguin@gmail.com> > wrote: > >> Is "aws kinesis get-shard-iterator --shard-iterator-type LATEST" not >> enough for your usecase? >> >> On Mon,

Re: spark streaming with kinesis

2016-11-14 Thread Takeshi Yamamuro
ream? > > > > On Mon, Nov 14, 2016 at 5:43 PM, Takeshi Yamamuro <linguin@gmail.com> > wrote: > >> Hi, >> >> The time interval can be controlled by `IdleTimeBetweenReadsInMillis` >> in KinesisClientLibConfiguration though, >> it is not configu

Re: spark streaming with kinesis

2016-11-14 Thread Takeshi Yamamuro
at which receiver > fetched data from kinesis . > > Means stream batch interval cannot be less than *spark.streaming.blockInterval > and this should be configrable , Also is there any minimum value for > streaming batch interval ?* > > *Thanks* > > -- --- Takeshi Yamamuro

Re: Convert SparseVector column to Densevector column

2016-11-13 Thread Takeshi Yamamuro
// maropu On Mon, Nov 14, 2016 at 1:20 PM, janardhan shetty <janardhan...@gmail.com> wrote: > Hi, > > Is there any easy way of converting a dataframe column from SparseVector > to DenseVector using > > import org.apache.spark.ml.linalg.DenseVector API ? > > Spark ML 2.0 > -- --- Takeshi Yamamuro

Re: spark streaming with kinesis

2016-11-07 Thread Takeshi Yamamuro
afka in kinesis spark streaming? > > Is there any limitation on interval checkpoint - minimum of 1second in > spark streaming with kinesis. But as such there is no limit on checkpoint > interval in KCL side ? > > Thanks > > On Tue, Oct 25, 2016 at 8:36 AM, Takeshi Yamam

Re: spark streaming with kinesis

2016-10-24 Thread Takeshi Yamamuro
heckpoint the sequence no using some api. > > > > On Tue, Oct 25, 2016 at 7:07 AM, Takeshi Yamamuro <linguin@gmail.com> > wrote: > >> Hi, >> >> The only thing you can do for Kinesis checkpoints is tune the interval of >> them. >> https://github.com/apach

Re: Get size of intermediate results

2016-10-24 Thread Takeshi Yamamuro
the devel environment and i can compile spark. It was >> really awesome how smoothly the setup was :) Thx for that. >> >> Servus >> Andy >> >> --------- >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> > > > -- > > > *Sincerely yoursEgor Pakhomov* > -- --- Takeshi Yamamuro

Re: spark streaming with kinesis

2016-10-24 Thread Takeshi Yamamuro
to checkpoint the sequenece numbers ourselves in Kinesis as > it is in Kafka low level consumer ? > > Thanks > > -- --- Takeshi Yamamuro

Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

2016-10-01 Thread Takeshi Yamamuro
spark 2.0 which is shipped with hadoop dependency of 2.7.2 and we > use this setting. > We've sort of "verified" it's used by configuring log of file output > commiter > > On 30 September 2016 at 03:12, Takeshi Yamamuro <linguin@gmail.com> > wrote: > >

Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

2016-09-29 Thread Takeshi Yamamuro
m/S3-DirectParquetOutputCommitter- > PartitionBy-SaveMode-Append-tp26398p27810.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > ----- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- --- Takeshi Yamamuro

Re: Broadcast big dataset

2016-09-28 Thread Takeshi Yamamuro
Any advice is appreciated. > Thank you! > > > > -- > View this message in context: http://apache-spark- > developers-list.1001551.n3.nabble.com/Broadcast-big-dataset-tp19127.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- --- Takeshi Yamamuro

Re: Tuning Spark memory

2016-09-23 Thread Takeshi Yamamuro
the sort and storage on > HDFS? > > Thanks. > -- --- Takeshi Yamamuro

Re: Spark output data to S3 is very slow

2016-09-16 Thread Takeshi Yamamuro
s email in error, please notify us at le...@appannie.com > <le...@appannie.com>** immediately and remove it from your system.* -- --- Takeshi Yamamuro

Re: JDBC Very Slow

2016-09-16 Thread Takeshi Yamamuro
g or keeps timing out. >> >> The code is simple. >> >> val jdbcDF = sqlContext.read.format("jdbc").options( >> Map("url" -> "jdbc:postgresql://dbserver:po >> rt/database?user=user=password", >>"dbtable" -> “schema.table")).load() >> >> jdbcDF.show >> >> >> If anyone can help, please let me know. >> >> Thanks, >> Ben >> >> > -- --- Takeshi Yamamuro

Re: Spark SQL Thriftserver

2016-09-13 Thread Takeshi Yamamuro
;>> engine. You will see this in hive.log file >>> >>> So I don't think it is going to give you much difference. Unless they >>> have recently changed the design of STS. >>> >>> HTH >>> >>> >>> >>> >>> D

Re: Debugging a spark application in a none lazy mode

2016-09-12 Thread Takeshi Yamamuro
te: > Hi, > > Not sure what you mean, can you give an example? > > > > Hagai. > > > > *From: *Takeshi Yamamuro <linguin@gmail.com> > *Date: *Monday, September 12, 2016 at 7:24 PM > *To: *Hagai Attias <hatt...@akamai.com> > *Cc: *"user@spark.ap

Re: Debugging a spark application in a none lazy mode

2016-09-12 Thread Takeshi Yamamuro
-- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- --- Takeshi Yamamuro

Re: Spark_JDBC_Partitions

2016-09-10 Thread Takeshi Yamamuro
gt;> partitions? Can we set any fake query to orchestrate this pull process, as >> we do in SQOOP like this '--boundary-query "SELECT CAST(0 AS NUMBER) AS >> MIN_MOD_VAL, CAST(12 AS NUMBER) AS MAX_MOD_VAL FROM DUAL"' ? >> >> Any pointers are appreciated. >> >> Thanks for your time. >> >> ~ Ajay >> > > -- --- Takeshi Yamamuro

Re: java.io.IOException: FAILED_TO_UNCOMPRESS(5)

2016-09-10 Thread Takeshi Yamamuro
r.java:615) > at java.lang.Thread.run(Thread.java:745) > > env info > > spark on yarn(cluster)scalaVersion := "2.10.6" > libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.0" % > "provided"libraryDependencies += "org.apache.spark" %% "spark-mllib" % > "1.6.0" % "provided" > > > ​THANKS​ > > > -- > cente...@gmail.com > -- --- Takeshi Yamamuro

Re: Any estimate for a Spark 2.0.1 release date?

2016-09-06 Thread Takeshi Yamamuro
release >> -date-tp27659.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com >> <http://nabble.com>. >> >> ------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> > > > -- > --- > Takeshi Yamamuro > > -- --- Takeshi Yamamuro

Re: Any estimate for a Spark 2.0.1 release date?

2016-09-05 Thread Takeshi Yamamuro
- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- --- Takeshi Yamamuro

Re: broadcast fails on join

2016-08-30 Thread Takeshi Yamamuro
----- > View this message in context: broadcast fails on join > <http://apache-spark-user-list.1001560.n3.nabble.com/broadcast-fails-on-join-tp27623.html> > Sent from the Apache Spark User List mailing list archive > <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com. > -- --- Takeshi Yamamuro

Re: Caching broadcasted DataFrames?

2016-08-25 Thread Takeshi Yamamuro
gt; a copy of the dataset (all partitions) inside its own memory. > > Since the dataset for d1 is used in two separate joins, should I also > persist it to prevent reading it from disk again? Or would broadcasting the > data already take care of that? > > > Thank you, > Jestin > -- --- Takeshi Yamamuro

Re: spark 2.0.0 - when saving a model to S3 spark creates temporary files. Why?

2016-08-25 Thread Takeshi Yamamuro
afaik no. // maropu On Thu, Aug 25, 2016 at 9:16 PM, Tal Grynbaum <tal.grynb...@gmail.com> wrote: > Is/was there an option similar to DirectParquetOutputCommitter to write > json files to S3 ? > > On Thu, Aug 25, 2016 at 2:56 PM, Takeshi Yamamuro <linguin@gmail.

Re: spark 2.0.0 - when saving a model to S3 spark creates temporary files. Why?

2016-08-25 Thread Takeshi Yamamuro
ns...@gmail.com> wrote: > >> Hi >> >> When Spark saves anything to S3 it creates temporary files. Why? Asking >> this as this requires the the access credentails to be given >> delete permissions along with write permissions. >> > -- --- Takeshi Yamamuro

Re: Spark SQL and number of task

2016-08-04 Thread Takeshi Yamamuro
+- Scan > org.apache.spark.sql.cassandra.CassandraSourceRelation@49243f65[id#0L,avg#2] > PushedFilters: [Or(EqualTo(id,94),EqualTo(id,2))] | > > +--+--+ > > > Filters are pushed down, so I cannot realize why it is per

Re: Spark SQL and number of task

2016-08-04 Thread Takeshi Yamamuro
avg) from v_points d where id in (90,2) group by id; > > query is again fast. > > How can I get the 'execution plan' of the query? > > And also, how can I kill the long running submitted tasks? > > Thanks all! > -- --- Takeshi Yamamuro

Re: SparkSession for RDBMS

2016-08-03 Thread Takeshi Yamamuro
ions, upper and lower boundary if we > are not specifying anything. > > -- > Selvam Raman > "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து" > -- --- Takeshi Yamamuro

Re: Sqoop On Spark

2016-08-01 Thread Takeshi Yamamuro
xecution engine in sqoop2. i see the patch(S > QOOP-1532 <https://issues.apache.org/jira/browse/SQOOP-1532>) but it > shows in progess. > > so can not we use sqoop on spark. > > Please help me if you have an any idea. > > -- > Selvam Raman > "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து" > -- --- Takeshi Yamamuro

Re: Possible to push sub-queries down into the DataSource impl?

2016-07-28 Thread Takeshi Yamamuro
ondering if Spark has > >> the hooks to allow me to try ;-) > >> > >> Cheers, > >> Tim > >> > >> ----- > >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >> > > > > > > -- > > Ing. Marco Colombo > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- --- Takeshi Yamamuro

Re: Setting spark.sql.shuffle.partitions Dynamically

2016-07-27 Thread Takeshi Yamamuro
of a dataframe. I only know the in memory size > of the dataframe halfway through the spark job. So I would need to stop the > context and recreate it in order to set this config. > > Is there any better way to set this? How > does spark.sql.shuffle.partitions work differently than .repartition? > > Brandon > -- --- Takeshi Yamamuro

Re: read parquetfile in spark-sql error

2016-07-25 Thread Takeshi Yamamuro
Driver.processFile(CliDriver.java:425) >> at >> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:166) >> at >> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:606) >> at >> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) >> at >> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) >> at >> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) >> at >> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) >> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) >> Error in query: cannot recognize input near 'parquetTable' 'USING' 'org' >> in table name; line 2 pos 0 >> >> >> am I use it in the wrong way? >> >> >> >> >> >> thanks >> > -- --- Takeshi Yamamuro

Re: Bzip2 to Parquet format

2016-07-25 Thread Takeshi Yamamuro
che.spark.sql.SQLContext >> >> >> >> On Jul 24, 2016, at 5:34 PM, janardhan shetty <janardhan...@gmail.com> >> wrote: >> >> We have data in Bz2 compression format. Any links in Spark to convert >> into Parquet and also performance benchmarks and uses study materials ? >> >> >> > -- --- Takeshi Yamamuro

Re: Tools for Balancing Partitions by Size

2016-07-12 Thread Takeshi Yamamuro
ning | CU Boulder > UC Berkeley AMPLab Alumni > > ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 > Github: github.com/EntilZha | LinkedIn: > https://www.linkedin.com/in/pedrorodriguezscience > > -- --- Takeshi Yamamuro

Re: Spark cluster tuning recommendation

2016-07-12 Thread Takeshi Yamamuro
>>>- *Status:* ALIVE >>> >>> Each worker has 8 cores and 4GB memory. >>> >>> My questions is how do people running in production decide these >>> properties - >>> >>> 1) --num-executors >>> 2) --executor-cores >>> 3) --executor-memory >>> 4) num of partitions >>> 5) spark.default.parallelism >>> >>> Thanks, >>> Kartik >>> >>> >>> >> > -- --- Takeshi Yamamuro

Re: Spark crashes with two parquet files

2016-07-10 Thread Takeshi Yamamuro
quetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:755) >> at >> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:494) >> at >> org.apache.spark.sql.execution.datasources.parquet.UnsafeRowParquetRecordReader.checkE

Re: How to run Zeppelin and Spark Thrift Server Together

2016-07-10 Thread Takeshi Yamamuro
kContext was created at: >> >> Is that mean I need to setup allow multiple context? Because It’s only >> test in local with local mode If I deploy on mesos cluster what would >> happened? >> >> Need you guys suggests some solutions for that. Thanks. >> >> Chanh >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> > > > -- > Best Regards, > Ayan Guha > -- --- Takeshi Yamamuro

Re: Spark crashes with two parquet files

2016-07-10 Thread Takeshi Yamamuro
ode works: > > path = '/data/train_parquet/0_0_0.parquet' > train0_df = sqlContext.read.load(path) > train_df.take(1) > > Thanks in advance. > > Samir > -- --- Takeshi Yamamuro

Re: IS NOT NULL is not working in programmatic SQL in spark

2016-07-10 Thread Takeshi Yamamuro
UP >>> > BY code").show() >>> > >>> > Output >>> > = >>> > +---++ >>> > |_c0|code| >>> > +---++ >>> > | 18| AS| >>> > | 16| | >>> > | 13| UK| >>> > | 14| US| >>> > | 20| As| >>> > | 15| IN| >>> > | 19| IR| >>> > | 11| PK| >>> > +---++ >>> > >>> > i am expecting the below one any idea, how to apply IS NOT NULL ? >>> > >>> > +---++ >>> > |_c0|code| >>> > +---++ >>> > | 18| AS| >>> > | 13| UK| >>> > | 14| US| >>> > | 20| As| >>> > | 15| IN| >>> > | 19| IR| >>> > | 11| PK| >>> > +---++ >>> > >>> > >>> > >>> > Thanks & Regards >>> >Radha krishna >>> > >>> > >>> >> >> >> >> -- >> >> >> >> >> >> >> >> >> Thanks & Regards >>Radha krishna >> >> >> -- --- Takeshi Yamamuro

Re: Enforcing shuffle hash join

2016-07-04 Thread Takeshi Yamamuro
t; > On Mon, Jul 4, 2016 at 10:17 PM, Takeshi Yamamuro <linguin@gmail.com> > wrote: > >> The join selection can be described in >> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L92

  1   2   3   >