Hey,
Recently, we found in our cluster, that when we kill a spark streaming
app, the whole cluster cannot response for 10 minutes.
And, we investigate the master node, and found the master process
consumes 100% CPU when we kill the spark streaming app.
How could it happen? Did any
Hi,
In our project, we use "stand alone duo master" + "zookeeper" to make
the HA of spark master.
Now the problem is, how do we know which master is the current alive
master?
We tried to read the info that the master stored in zookeeper. But we
found there is no information to
APEETHAM | Amritapuri | Cell +919946535290 |
>
>
> On Tue, Jan 20, 2015 at 4:45 PM, DEVAN M.S. wrote:
>
>> Which context are you using HiveContext or SQLContext ? Can you try with
>> HiveContext
>> ??
>>
>>
>> Devan M.S. | Research Associate | Cyber
Hi, I'm using Spark 1.2
On Tue, Jan 20, 2015 at 5:59 PM, Wang, Daoyuan
wrote:
> Hi Xuelin,
>
>
>
> What version of Spark are you using?
>
>
>
> Thanks,
>
> Daoyuan
>
>
>
> *From:* Xuelin Cao [mailto:xuelincao2...@gmail.com]
> *Sent:* Tues
Hi,
I'm trying to migrate some hive scripts to Spark-SQL. However, I
found some statement is incompatible in Spark-sql.
Here is my SQL. And the same SQL works fine in HIVE environment.
SELECT
*if(ad_user_id>1000, 1000, ad_user_id) as user_id*
FROM
ad_search_keywor
ustin
>
> On Mon, Jan 12, 2015 at 9:50 PM, Xuelin Cao
> wrote:
>
>>
>>
>> Hi,
>>
>> I'd like to create a transform function, that convert RDD[String] to
>> RDD[Int]
>>
>> Occasionally, the input RDD could be an empty RDD. I ju
Hi,
I'd like to create a transform function, that convert RDD[String] to
RDD[Int]
Occasionally, the input RDD could be an empty RDD. I just want to
directly create an empty RDD[Int] if the input RDD is empty. And, I don't
want to return None as the result.
Is there an easy way to do
ng etc.).
>
> Why not increase the tasks per core?
>
> Best regards
> Le 9 janv. 2015 06:46, "Xuelin Cao" a écrit :
>
>
>> Hi,
>>
>> I'm wondering whether it is a good idea to overcommit CPU cores on
>> the spark cluster.
>>
>>
Hi,
I'm wondering whether it is a good idea to overcommit CPU cores on
the spark cluster.
For example, in our testing cluster, each worker machine has 24
physical CPU cores. However, we are allowed to set the CPU core number to
48 or more in the spark configuration file. As a result,
)
>
> The input data of the first statement is 292KB, the second is 49.1KB.
>
> The JSON file I used is examples/src/main/resources/people.json, I copied
> its contents multiple times to generate a larger file.
>
> Cheng
>
> On 1/8/15 7:43 PM, Xuelin Cao wrote:
>
>
&
the input data for each task is also 1212.5MB
On Thu, Jan 8, 2015 at 6:40 PM, Cheng Lian wrote:
> Hey Xuelin, which data item in the Web UI did you check?
>
>
> On 1/7/15 5:37 PM, Xuelin Cao wrote:
>
>
> Hi,
>
>Curious and curious. I'm puzzl
not.
>
> Tim
>
> On Wed, Jan 7, 2015 at 11:19 PM, Xuelin Cao
> wrote:
>
>>
>> Hi,
>>
>> Thanks for the information.
>>
>> One more thing I want to clarify, when does Mesos or Yarn allocate
>> and release the resource? Aka,
on, I think it's important to see
> what other applications you want to be running besides Spark in the same
> cluster and also your use cases, to see what resource management fits your
> need.
>
> Tim
>
>
> On Wed, Jan 7, 2015 at 10:55 PM, Xuelin Cao
> wrote:
>
>
Hi,
Currently, we are building up a middle scale spark cluster (100 nodes)
in our company. One thing bothering us is, the how spark manages the
resource (CPU, memory).
I know there are 3 resource management modes: stand-along, Mesos, Yarn
In the stand along mode, the cluster maste
null pointers
>> when there are full row groups that are null.
>>
>> https://issues.apache.org/jira/browse/SPARK-4258
>>
>> You can turn it on if you want:
>> http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration
>>
>> Daniel
>>
>>
Hi,
Curious and curious. I'm puzzled by the Spark SQL cached table.
Theoretically, the cached table should be columnar table, and only scan
the column that included in my SQL.
However, in my test, I always see the whole table is scanned even though
I only "select" one column i
Hi,
I'm testing parquet file format, and the predicate pushdown is a very
useful feature for us.
However, it looks like the predicate push down doesn't work after I set
sqlContext.sql("SET spark.sql.parquet.filterPushdown=true") Here
is my sql: sqlContext.sql("
ioned table support? That would only scan data where
the predicate matches the partition. Depending on the cardinality of the
customerId column that could be a good option for you.
On Wed, Dec 17, 2014 at 2:25 AM, Xuelin Cao wrote:
Hi,
In Spark SQL help document, it says "Some of thes
Hi,
In Spark SQL help document, it says "Some of these (such as indexes) are
less important due to Spark SQL’s in-memory computational model. Others are
slotted for future releases of Spark SQL.
- Block level bitmap indexes and virtual columns (used to build indexes)"
For our
Hi,
I tried to create a function that to convert an Unix time stamp to the
hour number in a day.
It works if the code is like this:sqlContext.registerFunction("toHour",
(x:Long)=>{new java.util.Date(x*1000).getHours})
But, if I do it like this, it doesn't work:
def toHour
Hi,
I'm wondering whether there is an efficient way to continuously append
new data to a registered spark SQL table.
This is what I want: I want to make an ad-hoc query service to a
json formated system log. Certainly, the system log is continuously generated.
I will use spark
Hi,
I'm generating a Spark SQL table from an offline Json file.
The difficulty is, in the original json file, there is a hierarchical
structure. And, as a result, this is what I get:
scala> tb.printSchemaroot |-- budget: double (nullable = true) |-- filterIp:
array (nullable = true) |
Hi,
I'm generating a Spark SQL table from an offline Json file.
The difficulty is, in the original json file, there is a hierarchical
structure. And, as a result, this is what I get:
scala> tb.printSchemaroot |-- budget: double (nullable = true) |-- filterIp:
array (nullable = true) |
Hi,
I'm generating a Spark SQL table from an offline Json file.
The difficulty is, in the original json file, there is a hierarchical
structure. And, as a result, this is what I get:
scala> tb.printSchemaroot |-- budget: double (nullable = true) |-- filterIp:
array (nullable = true) |
Hi,
I'd like to make an operation on an RDD that ONLY change the value of
some items, without make a full copy or full scan of each data.
It is useful when I need to handle a large RDD, and each time I need only
to change a little fraction of the data, and keeps other data unchanged.
Hi,
I'm going to debug some spark applications on our testing platform. And it
would be helpful if we can see the eventLog on the worker node.
I've tried to turn on spark.eventLog.enabled and set spark.eventLog.dir
parameters on the worker node. However, it doesn't work.
I do ha
Hi,
I'm going to debug some spark applications on our testing platform. And
it would be helpful if we can see the eventLog on the *worker *node.
I've tried to turn on *spark.eventLog.enabled* and set
*spark.eventLog.dir* parameters on the worker node. However, it doesn't
work.
I do
27 matches
Mail list logo