[jira] [Comment Edited] (APEXCORE-392) Stack Overflow when launching jobs

Ilya Ganelin (JIRA) Sat, 19 Mar 2016 02:39:19 -0700

    [ 
https://issues.apache.org/jira/browse/APEXCORE-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200660#comment-15200660
 ]


Ilya Ganelin edited comment on APEXCORE-392 at 3/18/16 12:13 AM:
-----------------------------------------------------------------

I've updated the gist with a more complete example that reliably causes this 
failure. The code is not a complete project because other components of the 
project include code that cannot be shared. 

An example that causes this to break:
{{
// Add our operators to the DAG
NewlineFileInputOperator fileInputOperator =
dag.addOperator("NewLines", NewlineFileInputOperator.class);
fileInputOperator.setCoreSite(conf.get("coreSite"));
fileInputOperator.setHdfsSite(conf.get("hdfsSite"));
fileInputOperator.setDirectory(conf.get("citadel.data.rawAuths"));

setPartitioner(dag, fileInputOperator, conf.getInt("citadel.numHdfsPartitions", 
16));

String label = "HDHT";
final HdfsFileOutputOperator latenciesOutput =
addHdfsOutputOp(dag, "LatencyOut" + label, "latencies", conf);
setPartitioner(dag, latenciesOutput, conf.getInt("citadel.numHdfsPartitions", 
16));

dag.addStream("latencies2File" + label, fileInputOperator.output, 
latenciesOutput.input);
}}
    <property>
        <name>citadel.numPartitions</name>
        <value>20</value>
    </property>

    <property>
        <name>citadel.numHdfsPartitions</name>
        <value>20</value>
    </property>


was (Author: ilganeli):
I've updated the gist with a more complete example that reliably causes this 
failure. The code is not a complete project because other components of the 
project include code that cannot be shared. 

An example that causes this to break:
{{code}}
// Add our operators to the DAG
NewlineFileInputOperator fileInputOperator =
dag.addOperator("NewLines", NewlineFileInputOperator.class);
fileInputOperator.setCoreSite(conf.get("coreSite"));
fileInputOperator.setHdfsSite(conf.get("hdfsSite"));
fileInputOperator.setDirectory(conf.get("citadel.data.rawAuths"));

setPartitioner(dag, fileInputOperator, conf.getInt("citadel.numHdfsPartitions", 
16));

String label = "HDHT";
final HdfsFileOutputOperator latenciesOutput =
addHdfsOutputOp(dag, "LatencyOut" + label, "latencies", conf);
setPartitioner(dag, latenciesOutput, conf.getInt("citadel.numHdfsPartitions", 
16));

dag.addStream("latencies2File" + label, fileInputOperator.output, 
latenciesOutput.input);
{{code}}
    <property>
        <name>citadel.numPartitions</name>
        <value>20</value>
    </property>

    <property>
        <name>citadel.numHdfsPartitions</name>
        <value>20</value>
    </property>

> Stack Overflow when launching jobs
> ----------------------------------
>
>                 Key: APEXCORE-392
>                 URL: https://issues.apache.org/jira/browse/APEXCORE-392
>             Project: Apache Apex Core
>          Issue Type: Bug
>    Affects Versions: 3.2.0, 3.3.0
>            Reporter: Ilya Ganelin
>            Priority: Blocker
>
> I’m running into a very frustrating issue where certain DAG configurations 
> cause the following error log (attached). When this happens, my application 
> even fails to launch. This does not seem to be a YARN issue since this occurs 
> even with a relatively small number of partitions/memory. 
> This issue DOES appear to be related to HDFS input/output operations since 
> the specific parameter that appears to affect things is the number of 
> physical partitions for the HDFS input/output operators.
> I’ve also attached the input and output operators in question:
> https://gist.github.com/ilganeli/7f770374113b40ffa18a
> I can get this to occur predictable by
>   1.  Increasing the partition count on my input operator (reads from HDFS) - 
> values above 20 cause this error
>   2.  Increase the partition count on my output operator (writes to HDFS) - 
> values above 20 cause this error
>   3.  Set stream locality from the default to either thread local, node 
> local, or container_local on the output operator
> This behavior is very frustrating as it’s preventing me from partitioning my 
> HDFS I/O appropriately, thus allowing me to scale to higher throughputs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (APEXCORE-392) Stack Overflow when launching jobs

Reply via email to