[DataCleaner-notify] Re: [datacleaner.org] Runnin DataCleaner on Spark as a local file

Henrique Wed, 31 Aug 2016 03:03:47 -0700

New reply on DataCleaner's online discussion forum 
(https://datacleaner.org/forum):


Henrique replied to subject 'Runnin DataCleaner on Spark as a local file'

-------------------

Just saw on the code, several references for Yarn. It is suppossed to run just 
on YARN mode?

/Users/henriqueandrade/Documents/App/spark/spark-1.6.2-bin-hadoop2.6/bin/spark-submit
 \
--class org.datacleaner.spark.Main \
--master local[1] \
DataCleaner-env-spark-5.1.3-SNAPSHOT-jar-with-dependencies.jar \
conf_local.xml \
vanilla-job.analysis.xml \
jobAbsolutePath.properties

jobAbsolutePath.properties
datacleaner.result.hdfs.path=s3n://exceed-ingestion/results/myresult.analysis.result.dat

conf_local.xml
<configuration xmlns="http://eobjects.org/analyzerbeans/configuration/1.0";
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";>

        <datastore-catalog>
                <csv-datastore name="person_names">
                        
<filename>file:///Users/henriqueandrade/Documents/App/spark/DataCleaner/engine/env/spark/target/person_names.txt</filename>
      <quote-char>"</quote-char>
      <separator-char>,</separator-char>
      <escape-char>\</escape-char>
      <encoding>UTF-8</encoding>
      <fail-on-inconsistencies>true</fail-on-inconsistencies>
      <multiline-values>false</multiline-values>
      <header-line-number>1</header-line-number>
                </csv-datastore>
                <json-datastore name="person_data">
                        <filename>./person_data.json</filename>
                </json-datastore>
        </datastore-catalog>

</configuration>

vanilla-job.analysis.xml
<?xml version="1.0" encoding="UTF-8"?>
<job xmlns="http://eobjects.org/analyzerbeans/job/1.0"; 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";>

        <source>
                <data-context ref="person_names" />
                <columns>
                        <column id="col_id" path="id" />
                        <column id="col_name" path="name" />
                        <column id="col_company" path="company" />
                        <column id="col_country" path="country" />
                </columns>
        </source>

        <analysis>
                <analyzer>
                        <descriptor ref="String analyzer" />
                        <input ref="col_company" />
                         <properties/>
                </analyzer>

                <analyzer>
                        <descriptor ref="Value distribution" />
                        <input ref="col_country" />
                        <properties>
                <property name="Record unique values" value="true"/>
                <property name="Record drill-down information" value="true"/>
                <property name="Top n most frequent values" 
value="&lt;null&gt;"/>
                <property name="Bottom n most frequent values" 
value="&lt;null&gt;"/>
            </properties>
                </analyzer>
        </analysis>
</job>


-------------------

View the topic online to reply - go to 
https://datacleaner.org/topic/1140/Runnin-DataCleaner-on-Spark-as-a-local-file

-- 
You received this message because you are subscribed to the Google Groups 
"DataCleaner-notify" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/datacleaner-notify.
For more options, visit https://groups.google.com/d/optout.

[DataCleaner-notify] Re: [datacleaner.org] Runnin DataCleaner on Spark as a local file

Reply via email to