[GitHub] spark pull request #17239: Using map function in spark for huge operation

nischay21 Thu, 09 Mar 2017 23:43:54 -0800

GitHub user nischay21 opened a pull request:

    https://github.com/apache/spark/pull/17239


    Using map function in spark for huge operation

    We need to calculate distance matrix like jaccard on huge collection of 
Dataset in spark.
    Facing couple of issues. Kindly help us to give directions.
    
    Issue 1.
    
                        import info.debatty.java.stringsimilarity.Jaccard;
                        //sample Data set creation
                        List<Row> data = Arrays.asList(
                        RowFactory.create("Hi I heard about Spark", "Hi I Know 
about Spark"),
                        RowFactory.create("I wish Java could use case 
classes","I wish C# could use case classes"),
                        
RowFactory.create("Logistic,regression,models,are,neat","Logistic,regression,models,are,neat"));
                        
                        StructType schema = new StructType(new StructField[] 
{new StructField("label", DataTypes.StringType, false,Metadata.empty()),
                        new StructField("sentence", DataTypes.StringType, 
false,Metadata.empty()) });
                        Dataset<Row> sentenceDataFrame = 
spark.createDataFrame(data, schema);
                        
                        // Distance matrix object creation
                        Jaccard jaccard=new Jaccard();
    
                        //Working on each of the member element of dataset and 
applying distance matrix.
                        Dataset<String> sentenceDataFrame1 
=sentenceDataFrame.map(
                                        (MapFunction<Row, String>) row -> 
"Name: " + 
jaccard.similarity(row.getString(0),row.getString(1)),Encoders.STRING()
                        );
                        sentenceDataFrame1.show();
    
    No compile time errors. But getting run time exception like 
org.apache.spark.SparkException: Task not serializable
    
    Issue 2.
    Moreover we need to find which pair is having highest score for which we 
need to declare some variables. Also we need to perform other calculation as 
well, we are facing lots of difficulty. Even if I try to declare a simple 
variable like counter within MapBlock we are not able to capture the 
incremented value. If we declare outside the Map block we are getting lots of 
compile time errors.
                
                
                int counter=0;
                        Dataset<String> sentenceDataFrame1 
=sentenceDataFrame.map(
                                (MapFunction<Row,  String>) row -> {
                                                System.out.println("Name: " + 
row.getString(1));
                                                //int counter = 0;
                                                counter++;
                                                System.out.println("Counter: " 
+ counter);
                                                return counter+"";              
                                                
                                },Encoders.STRING()                             
                        
                        );
                        
    Please gives us directions.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark branch-2.1

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17239.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17239
    
----
commit 1cafc76ea1e9eef40b24060d1cd7c4aaf9f16a49
Author: Shixiong Zhu <[email protected]>
Date:   2016-12-09T01:58:44Z

    [SPARK-18774][CORE][SQL] Ignore non-existing files when ignoreCorruptFiles 
is enabled (branch 2.1)
    
    ## What changes were proposed in this pull request?
    
    Backport #16203 to branch 2.1.
    
    ## How was this patch tested?
    
    Jennkins
    
    Author: Shixiong Zhu <[email protected]>
    
    Closes #16216 from zsxwing/SPARK-18774-2.1.

commit ef5646b4c6792a96e85d1dd4bb3103ba8306949b
Author: Shivaram Venkataraman <[email protected]>
Date:   2016-12-09T02:26:54Z

    [SPARKR][PYSPARK] Fix R source package name to match Spark version. Remove 
pip tar.gz from distribution
    
    ## What changes were proposed in this pull request?
    
    Fixes name of R source package so that the `cp` in release-build.sh works 
correctly.
    
    Issue discussed in 
https://github.com/apache/spark/pull/16014#issuecomment-265867125
    
    Author: Shivaram Venkataraman <[email protected]>
    
    Closes #16221 from shivaram/fix-sparkr-release-build-name.
    
    (cherry picked from commit 4ac8b20bf2f962d9b8b6b209468896758d49efe3)
    Signed-off-by: Shivaram Venkataraman <[email protected]>

commit 4ceed95b43d0cd9665004865095a40926efcc289
Author: [email protected] <[email protected]>
Date:   2016-12-09T06:08:19Z

    [SPARK-18349][SPARKR] Update R API documentation on ml model summary
    
    ## What changes were proposed in this pull request?
    In this PR, the document of `summary` method is improved in the format:
    
    returns summary information of the fitted model, which is a list. The list 
includes .......
    
    Since `summary` in R is mainly about the model, which is not the same as 
`summary` object on scala side, if there is one, the scala API doc is not 
pointed here.
    
    In current document, some `return` have `.` and some don't have. `.` is 
added to missed ones.
    
    Since spark.logit `summary` has a big refactoring, this PR doesn't include 
this one. It will be changed when the `spark.logit` PR is merged.
    
    ## How was this patch tested?
    
    Manual build.
    
    Author: [email protected] <[email protected]>
    
    Closes #16150 from wangmiao1981/audit2.
    
    (cherry picked from commit 86a96034ccb47c5bba2cd739d793240afcfc25f6)
    Signed-off-by: Felix Cheung <[email protected]>

commit e8f351f9a670fc4d43f15c8d7cd57e49fb9ceba2
Author: Shivaram Venkataraman <[email protected]>
Date:   2016-12-09T06:21:24Z

    Copy the SparkR source package with LFTP
    
    This PR adds a line in release-build.sh to copy the SparkR source archive 
using LFTP
    
    Author: Shivaram Venkataraman <[email protected]>
    
    Closes #16226 from shivaram/fix-sparkr-copy-build.
    
    (cherry picked from commit 934035ae7cb648fe61665d8efe0b7aa2bbe4ca47)
    Signed-off-by: Shivaram Venkataraman <[email protected]>

commit 2c88e1dc31e1b90605ad8ab85b20b131b4b3c722
Author: Felix Cheung <[email protected]>
Date:   2016-12-09T06:52:34Z

    Copy pyspark and SparkR packages to latest release dir too
    
    ## What changes were proposed in this pull request?
    
    Copy pyspark and SparkR packages to latest release dir, as per comment 
[here](https://github.com/apache/spark/pull/16226#discussion_r91664822)
    
    Author: Felix Cheung <[email protected]>
    
    Closes #16227 from felixcheung/pyrftp.
    
    (cherry picked from commit c074c96dc57bf18b28fafdcac0c768d75c642cba)
    Signed-off-by: Shivaram Venkataraman <[email protected]>

commit 72bf5199738c7ab0361b2b55eb4f4299048a21fa
Author: Zhan Zhang <[email protected]>
Date:   2016-12-09T08:35:06Z

    [SPARK-18637][SQL] Stateful UDF should be considered as nondeterministic
    
    Make stateful udf as nondeterministic
    
    Add new test cases with both Stateful and Stateless UDF.
    Without the patch, the test cases will throw exception:
    
    1 did not equal 10
    ScalaTestFailureLocation: 
org.apache.spark.sql.hive.execution.HiveUDFSuite$$anonfun$21 at 
(HiveUDFSuite.scala:501)
    org.scalatest.exceptions.TestFailedException: 1 did not equal 10
            at 
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
            at 
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
            ...
    
    Author: Zhan Zhang <[email protected]>
    
    Closes #16068 from zhzhan/state.
    
    (cherry picked from commit 67587d961d5f94a8639c20cb80127c86bf79d5a8)
    Signed-off-by: Wenchen Fan <[email protected]>

commit b226f10e3df8b789da6ef820b256f994b178fbbe
Author: Jacek Laskowski <[email protected]>
Date:   2016-12-09T10:45:57Z

    [MINOR][CORE][SQL][DOCS] Typo fixes
    
    ## What changes were proposed in this pull request?
    
    Typo fixes
    
    ## How was this patch tested?
    
    Local build. Awaiting the official build.
    
    Author: Jacek Laskowski <[email protected]>
    
    Closes #16144 from jaceklaskowski/typo-fixes.
    
    (cherry picked from commit b162cc0c2810c1a9fa2eee8e664ffae84f9eea11)
    Signed-off-by: Sean Owen <[email protected]>

commit 0c6415aeca7a5c2fc5462c483c60d770f0236efe
Author: Xiangrui Meng <[email protected]>
Date:   2016-12-09T15:51:46Z

    [SPARK-17822][R] Make JVMObjectTracker a member variable of RBackend
    
    ## What changes were proposed in this pull request?
    
    * This PR changes `JVMObjectTracker` from `object` to `class` and let its 
instance associated with each RBackend. So we can manage the lifecycle of JVM 
objects when there are multiple `RBackend` sessions. `RBackend.close` will 
clear the object tracker explicitly.
    * I assume that `SQLUtils` and `RRunner` do not need to track JVM 
instances, which could be wrong.
    * Small refactor of `SerDe.sqlSerDe` to increase readability.
    
    ## How was this patch tested?
    
    * Added unit tests for `JVMObjectTracker`.
    * Wait for Jenkins to run full tests.
    
    Author: Xiangrui Meng <[email protected]>
    
    Closes #16154 from mengxr/SPARK-17822.
    
    (cherry picked from commit fd48d80a6145ea94f03e7fc6e4d724a0fbccac58)
    Signed-off-by: Xiangrui Meng <[email protected]>

commit eb2d9bfd4e100789604ca0810929b42694ea7377
Author: Shivaram Venkataraman <[email protected]>
Date:   2016-12-09T18:12:56Z

    [MINOR][SPARKR] Fix SparkR regex in copy command
    
    Fix SparkR package copy regex. The existing code leads to
    ```
    Copying release tarballs to 
/home/****/public_html/spark-nightly/spark-branch-2.1-bin/spark-2.1.1-SNAPSHOT-2016_12_08_22_38-e8f351f-bin
    mput: SparkR-*: no files found
    ```
    
    Author: Shivaram Venkataraman <[email protected]>
    
    Closes #16231 from shivaram/typo-sparkr-build.
    
    (cherry picked from commit be5fc6ef72c7eb586b184b0f42ac50ef32843208)
    Signed-off-by: Shivaram Venkataraman <[email protected]>

commit 562507ef038f09ff422e9831416af5119282a9d0
Author: Kazuaki Ishizaki <[email protected]>
Date:   2016-12-09T22:13:36Z

    [SPARK-18745][SQL] Fix signed integer overflow due to toInt cast
    
    ## What changes were proposed in this pull request?
    
    This PR avoids that a result of a cast `toInt` is negative due to signed 
integer overflow (e.g. 0x0000_0000_1???????L.toInt < 0 ). This PR performs 
casts after we can ensure the value is within range of signed integer (the 
result of `max(array.length, ???)` is always integer).
    
    ## How was this patch tested?
    
    Manually executed query68 of TPC-DS with 100TB
    
    Author: Kazuaki Ishizaki <[email protected]>
    
    Closes #16235 from kiszk/SPARK-18745.
    
    (cherry picked from commit d60ab5fd9b6af9aa5080a2d13b3589d8b79c5c5c)
    Signed-off-by: Herman van Hovell <[email protected]>

commit e45345d91e333e0b5f9219e857affeda461863c6
Author: Xiangrui Meng <[email protected]>
Date:   2016-12-10T01:34:52Z

    [SPARK-18812][MLLIB] explain "Spark ML"
    
    ## What changes were proposed in this pull request?
    
    There has been some confusion around "Spark ML" vs. "MLlib". This PR adds 
some FAQ-like entries to the MLlib user guide to explain "Spark ML" and reduce 
the confusion.
    
    I check the [Spark FAQ page](http://spark.apache.org/faq.html), which seems 
too high-level for the content here. So I added it to the MLlib user guide 
instead.
    
    cc: mateiz
    
    Author: Xiangrui Meng <[email protected]>
    
    Closes #16241 from mengxr/SPARK-18812.
    
    (cherry picked from commit d2493a203e852adf63dde4e1fc993e8d11efec3d)
    Signed-off-by: Xiangrui Meng <[email protected]>

commit 8bf56cc46b96874565ebd8109f62e69e6c0cf151
Author: Felix Cheung <[email protected]>
Date:   2016-12-10T03:06:05Z

    [SPARK-18807][SPARKR] Should suppress output print for calls to JVM methods 
with void return values
    
    ## What changes were proposed in this pull request?
    
    Several SparkR API calling into JVM methods that have void return values 
are getting printed out, especially when running in a REPL or IDE.
    example:
    ```
    > setLogLevel("WARN")
    NULL
    ```
    We should fix this to make the result more clear.
    
    Also found a small change to return value of dropTempView in 2.1 - adding 
doc and test for it.
    
    ## How was this patch tested?
    
    manually - I didn't find a expect_*() method in testthat for this
    
    Author: Felix Cheung <[email protected]>
    
    Closes #16237 from felixcheung/rinvis.
    
    (cherry picked from commit 3e11d5bfef2f05bd6d42c4d6188eae6d63c963ef)
    Signed-off-by: Shivaram Venkataraman <[email protected]>

commit b020ce408507d7fd57f6d357054a2b3530a5b95e
Author: Burak Yavuz <[email protected]>
Date:   2016-12-10T06:49:51Z

    [SPARK-18811] StreamSource resolution should happen in stream execution 
thread
    
    ## What changes were proposed in this pull request?
    
    When you start a stream, if we are trying to resolve the source of the 
stream, for example if we need to resolve partition columns, this could take a 
long time. This long execution time should not block the main thread where 
`query.start()` was called on. It should happen in the stream execution thread 
possibly before starting any triggers.
    
    ## How was this patch tested?
    
    Unit test added. Made sure test fails with no code changes.
    
    Author: Burak Yavuz <[email protected]>
    
    Closes #16238 from brkyvz/SPARK-18811.
    
    (cherry picked from commit 63c9159870ee274c68e24360594ca01d476b9ace)
    Signed-off-by: Shixiong Zhu <[email protected]>

commit 2b36f4943051fafea0b12b662b4f4dab54806d26
Author: Huaxin Gao <[email protected]>
Date:   2016-12-10T14:41:40Z

    [SPARK-17460][SQL] Make sure sizeInBytes in Statistics will not overflow
    
    ## What changes were proposed in this pull request?
    
    1. In SparkStrategies.canBroadcast, I will add the check   
plan.statistics.sizeInBytes >= 0
    2. In LocalRelations.statistics, when calculate the statistics, I will 
change the size to BigInt so it won't overflow.
    
    ## How was this patch tested?
    
    I will add a test case to make sure the statistics.sizeInBytes won't 
overflow.
    
    Author: Huaxin Gao <[email protected]>
    
    Closes #16175 from huaxingao/spark-17460.
    
    (cherry picked from commit c5172568b59b4cf1d3dc7ed8c17a9bea2ea2ab79)
    Signed-off-by: Wenchen Fan <[email protected]>

commit 83822df02fcd541068dd9cd462293f3cddfb6631
Author: Dongjoon Hyun <[email protected]>
Date:   2016-12-10T16:40:10Z

    [MINOR][DOCS] Remove Apache Spark Wiki address
    
    ## What changes were proposed in this pull request?
    
    According to the notice of the following Wiki front page, we can remove the 
obsolete wiki pointer safely in `README.md` and `docs/index.md`, too. These two 
lines are the last occurrence of that links.
    
    ```
    All current wiki content has been merged into pages at 
http://spark.apache.org as of November 2016.
    Each page links to the new location of its information on the Spark web 
site.
    Obsolete wiki content is still hosted here, but carries a notice that it is 
no longer current.
    ```
    
    ## How was this patch tested?
    
    Manual.
    
    - `README.md`: 
https://github.com/dongjoon-hyun/spark/tree/remove_wiki_from_readme
    - `docs/index.md`:
    ```
    cd docs
    SKIP_API=1 jekyll build
    ```
    ![screen shot 2016-12-09 at 2 53 29 
pm](https://cloud.githubusercontent.com/assets/9700541/21067323/517252e2-be1f-11e6-85b1-2a4471131c5d.png)
    
    Author: Dongjoon Hyun <[email protected]>
    
    Closes #16239 from dongjoon-hyun/remove_wiki_from_readme.
    
    (cherry picked from commit f3a3fed76cb74ecd0f46031f337576ce60f54fb2)
    Signed-off-by: Sean Owen <[email protected]>

commit 5151dafaaa6533ea88f7173c136e004ad87abd04
Author: Michal Senkyr <[email protected]>
Date:   2016-12-10T19:54:07Z

    [SPARK-3359][DOCS] Fix greater-than symbols in Javadoc to allow building 
with Java 8
    
    ## What changes were proposed in this pull request?
    
    The API documentation build was failing when using Java 8 due to incorrect 
character `>` in Javadoc.
    
    Replace `>` with literals in Javadoc to allow the build to pass.
    
    ## How was this patch tested?
    
    Documentation was built and inspected manually to ensure it still displays 
correctly in the browser
    
    ```
    cd docs && jekyll serve
    ```
    
    Author: Michal Senkyr <[email protected]>
    
    Closes #16201 from michalsenkyr/javadoc8-gt-fix.
    
    (cherry picked from commit 114324832abce1fbb2c5f5b84a66d39dd2d4398a)
    Signed-off-by: Sean Owen <[email protected]>

commit de21ca46e5d992dd950b6dcec71d7aee0cf6532e
Author: wangzhenhua <[email protected]>
Date:   2016-12-11T05:25:29Z

    [SPARK-18815][SQL] Fix NPE when collecting column stats for string/binary 
column having only null values
    
    ## What changes were proposed in this pull request?
    
    During column stats collection, average and max length will be null if a 
column of string/binary type has only null values. To fix this, I use default 
size when avg/max length is null.
    
    ## How was this patch tested?
    
    Add a test for handling null columns
    
    Author: wangzhenhua <[email protected]>
    
    Closes #16243 from wzhfy/nullStats.
    
    (cherry picked from commit a29ee55aaadfe43ac9abb0eaf8b022b1e6d7babb)
    Signed-off-by: Reynold Xin <[email protected]>

commit d4c03f8769f063b0dfac7d000513a2bc20989549
Author: Wenchen Fan <[email protected]>
Date:   2016-12-11T09:12:46Z

    [SQL][MINOR] simplify a test to fix the maven tests
    
    ## What changes were proposed in this pull request?
    
    After https://github.com/apache/spark/pull/15620 , all of the Maven-based 
2.0 Jenkins jobs time out consistently. As I pointed out in 
https://github.com/apache/spark/pull/15620#discussion_r91829129 , it seems that 
the regression test is an overkill and may hit constants pool size limitation, 
which is a known issue and hasn't been fixed yet.
    
    Since #15620 only fix the code size limitation problem, we can simplify the 
test to avoid hitting constants pool size limitation.
    
    ## How was this patch tested?
    
    test only change
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #16244 from cloud-fan/minor.
    
    (cherry picked from commit 9abd05b6b94eda31c47bce1f913af988c35f1cb1)
    Signed-off-by: Sean Owen <[email protected]>

commit d5f14168d39433a02d065206c3910595339ff3dc
Author: krishnakalyan3 <[email protected]>
Date:   2016-12-11T09:28:16Z

    [SPARK-18628][ML] Update Scala param and Python param to have quotes
    
    ## What changes were proposed in this pull request?
    
    Updated Scala param and Python param to have quotes around the options 
making it easier for users to read.
    
    ## How was this patch tested?
    
    Manually checked the docstrings
    
    Author: krishnakalyan3 <[email protected]>
    
    Closes #16242 from krishnakalyan3/doc-string.
    
    (cherry picked from commit c802ad87182520662be51eb611ea1c64f4874c4e)
    Signed-off-by: Sean Owen <[email protected]>

commit 63693c17e4407ec61052553d563218787c6f0dd6
Author: Tyson Condie <[email protected]>
Date:   2016-12-12T07:38:31Z

    [SPARK-18790][SS] Keep a general offset history of stream batches
    
    ## What changes were proposed in this pull request?
    
    Instead of only keeping the minimum number of offsets around, we should 
keep enough information to allow us to roll back n batches and reexecute the 
stream starting from a given point. In particular, we should create a config in 
SQLConf, spark.sql.streaming.retainedBatches that defaults to 100 and ensure 
that we keep enough log files in the following places to roll back the 
specified number of batches:
    the offsets that are present in each batch
    versions of the state store
    the files lists stored for the FileStreamSource
    the metadata log stored by the FileStreamSink
    
    marmbrus zsxwing
    
    ## How was this patch tested?
    
    The following tests were added.
    
    ### StreamExecution offset metadata
    Test added to StreamingQuerySuite that ensures offset metadata is garbage 
collected according to minBatchesRetain
    
    ### CompactibleFileStreamLog
    Tests added in CompactibleFileStreamLogSuite to ensure that logs are purged 
starting before the first compaction file that proceeds the current batch id - 
minBatchesToRetain.
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.
    
    Author: Tyson Condie <[email protected]>
    
    Closes #16219 from tcondie/offset_hist.
    
    (cherry picked from commit 83a42897ae90d84a54373db386a985e3e2d5903a)
    Signed-off-by: Shixiong Zhu <[email protected]>

commit 35011608f492ddcb19144954ba96c45ca6f87784
Author: Bill Chambers <[email protected]>
Date:   2016-12-12T13:33:17Z

    [DOCS][MINOR] Clarify Where AccumulatorV2s are Displayed
    
    ## What changes were proposed in this pull request?
    
    This PR clarifies where accumulators will be displayed.
    
    ## How was this patch tested?
    
    No testing.
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.
    
    Author: Bill Chambers <[email protected]>
    Author: anabranch <[email protected]>
    Author: Bill Chambers <[email protected]>
    
    Closes #16180 from anabranch/improve-acc-docs.
    
    (cherry picked from commit 70ffff21f769b149bee787fe5901d9844a4d97b8)
    Signed-off-by: Sean Owen <[email protected]>

commit 523071f3fae72909b64c7f405868bbc85f5c3cde
Author: Yuming Wang <[email protected]>
Date:   2016-12-12T22:38:36Z

    [SPARK-18681][SQL] Fix filtering to compatible with partition keys of type 
int
    
    ## What changes were proposed in this pull request?
    
    Cloudera put 
`/var/run/cloudera-scm-agent/process/15000-hive-HIVEMETASTORE/hive-site.xml` as 
the configuration file for the Hive Metastore Server, where 
`hive.metastore.try.direct.sql=false`. But Spark isn't reading this 
configuration file and get default value `hive.metastore.try.direct.sql=true`. 
As mallman said, we should use `getMetaConf` method to obtain the original 
configuration from Hive Metastore Server. I have tested this method few times 
and the return value is always consistent with Hive Metastore Server.
    
    ## How was this patch tested?
    
    The existing tests.
    
    Author: Yuming Wang <[email protected]>
    
    Closes #16122 from wangyum/SPARK-18681.
    
    (cherry picked from commit 90abfd15f4b3f612a7b0ff65f03bf319c78a0243)
    Signed-off-by: Herman van Hovell <[email protected]>

commit 1aeb7f427d31bfd44f7abb7c56dd7661be8bbaa6
Author: Felix Cheung <[email protected]>
Date:   2016-12-12T22:40:41Z

    [SPARK-18810][SPARKR] SparkR install.spark does not work for RCs, snapshots
    
    ## What changes were proposed in this pull request?
    
    Support overriding the download url (include version directory) in an 
environment variable, `SPARKR_RELEASE_DOWNLOAD_URL`
    
    ## How was this patch tested?
    
    unit test, manually testing
    - snapshot build url
      - download when spark jar not cached
      - when spark jar is cached
    - RC build url
      - download when spark jar not cached
      - when spark jar is cached
    - multiple cached spark versions
    - starting with sparkR shell
    
    To use this,
    ```
    SPARKR_RELEASE_DOWNLOAD_URL=http://this_is_the_url_to_spark_release_tgz R
    ```
    then in R,
    ```
    library(SparkR) # or specify lib.loc
    sparkR.session()
    ```
    
    Author: Felix Cheung <[email protected]>
    
    Closes #16248 from felixcheung/rinstallurl.
    
    (cherry picked from commit 8a51cfdcad5f8397558ed2e245eb03650f37ce66)
    Signed-off-by: Shivaram Venkataraman <[email protected]>

commit 9dc5fa5f77d910e44746c5866cb77565c4b761d9
Author: Shixiong Zhu <[email protected]>
Date:   2016-12-13T06:31:22Z

    [SPARK-18796][SS] StreamingQueryManager should not block when starting a 
query
    
    ## What changes were proposed in this pull request?
    
    Major change in this PR:
    - Add `pendingQueryNames` and `pendingQueryIds` to track that are going to 
start but not yet put into `activeQueries` so that we don't need to hold a lock 
when starting a query.
    
    Minor changes:
    - Fix a potential NPE when the user sets `checkpointLocation` using SQLConf 
but doesn't specify a query name.
    - Add missing docs in `StreamingQueryListener`
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <[email protected]>
    
    Closes #16220 from zsxwing/SPARK-18796.
    
    (cherry picked from commit 417e45c58484a6b984ad2ce9ba8f47aa0a9983fd)
    Signed-off-by: Tathagata Das <[email protected]>

commit 9f0e3be622c77f7a677ce2c930b6dba2f652df00
Author: [email protected] <[email protected]>
Date:   2016-12-13T06:41:11Z

    [SPARK-18797][SPARKR] Update spark.logit in sparkr-vignettes
    
    ## What changes were proposed in this pull request?
    spark.logit is added in 2.1. We need to update spark-vignettes to reflect 
the changes. This is part of SparkR QA work.
    
    ## How was this patch tested?
    
    Manual build html. Please see attached image for the result.
    
![test](https://cloud.githubusercontent.com/assets/5033592/21032237/01b565fe-bd5d-11e6-8b59-4de4b6ef611d.jpeg)
    
    Author: [email protected] <[email protected]>
    
    Closes #16222 from wangmiao1981/veg.
    
    (cherry picked from commit 2aa16d03db79a642cbe21f387441c34fc51a8236)
    Signed-off-by: Xiangrui Meng <[email protected]>

commit 207107bca5e550657b02892eef74230787972d10
Author: Marcelo Vanzin <[email protected]>
Date:   2016-12-13T18:02:19Z

    [SPARK-18835][SQL] Don't expose Guava types in the JavaTypeInference API.
    
    This avoids issues during maven tests because of shading.
    
    Author: Marcelo Vanzin <[email protected]>
    
    Closes #16260 from vanzin/SPARK-18835.
    
    (cherry picked from commit f280ccf449f62a00eb4042dfbcf7a0715850fd4c)
    Signed-off-by: Marcelo Vanzin <[email protected]>

commit d5c4a5d06b3282aec8300d27510393161773061b
Author: jerryshao <[email protected]>
Date:   2016-12-13T18:37:45Z

    [SPARK-18840][YARN] Avoid throw exception when getting token renewal 
interval in non HDFS security environment
    
    ## What changes were proposed in this pull request?
    
    Fix `java.util.NoSuchElementException` when running Spark in non-hdfs 
security environment.
    
    In the current code, we assume `HDFS_DELEGATION_KIND` token will be found 
in Credentials. But in some cloud environments, HDFS is not required, so we 
should avoid this exception.
    
    ## How was this patch tested?
    
    Manually verified in local environment.
    
    Author: jerryshao <[email protected]>
    
    Closes #16265 from jerryshao/SPARK-18840.
    
    (cherry picked from commit 43298d157d58d5d03ffab818f8cdfc6eac783c55)
    Signed-off-by: Marcelo Vanzin <[email protected]>

commit 292a37f2455b12ef8dfbdaf5b905a69b8b5e3728
Author: Alex Bozarth <[email protected]>
Date:   2016-12-13T21:37:46Z

    [SPARK-18816][WEB UI] Executors Logs column only ran visibility check on 
initial table load
    
    ## What changes were proposed in this pull request?
    
    When I added a visibility check for the logs column on the executors page 
in #14382 the method I used only ran the check on the initial DataTable 
creation and not subsequent page loads. I moved the check out of the table 
definition and instead it runs on each page load. The jQuery DataTable 
functionality used is the same.
    
    ## How was this patch tested?
    
    Tested Manually
    
    No visible UI changes to screenshot.
    
    Author: Alex Bozarth <[email protected]>
    
    Closes #16256 from ajbozarth/spark18816.
    
    (cherry picked from commit aebf44e50b6b04b848829adbbe08b0f74f31eb32)
    Signed-off-by: Sean Owen <[email protected]>

commit f672bfdf9689c0ab74226b11785ada50b72cd488
Author: Shixiong Zhu <[email protected]>
Date:   2016-12-13T22:09:25Z

    [SPARK-18843][CORE] Fix timeout in awaitResultInForkJoinSafely (branch 2.1, 
2.0)
    
    ## What changes were proposed in this pull request?
    
    This PR fixes the timeout value in `awaitResultInForkJoinSafely` for 2.1 
and 2.0. Master has been fixed by https://github.com/apache/spark/pull/16230.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <[email protected]>
    
    Closes #16268 from zsxwing/SPARK-18843.

commit 25b97589e32ddc424df500059cd9962eb1b2fa6b
Author: Tathagata Das <[email protected]>
Date:   2016-12-13T22:14:25Z

    [SPARK-18834][SS] Expose event time stats through StreamingQueryProgress
    
    ## What changes were proposed in this pull request?
    
    - Changed `StreamingQueryProgress.watermark` to 
`StreamingQueryProgress.queryTimestamps` which is a `Map[String, String]` 
containing the following keys: "eventTime.max", "eventTime.min", 
"eventTime.avg", "processingTime", "watermark". All of them UTC formatted 
strings.
    
    - Renamed `StreamingQuery.timestamp` to 
`StreamingQueryProgress.triggerTimestamp` to differentiate from 
`queryTimestamps`. It has the timestamp of when the trigger was started.
    
    ## How was this patch tested?
    
    Updated tests
    
    Author: Tathagata Das <[email protected]>
    
    Closes #16258 from tdas/SPARK-18834.
    
    (cherry picked from commit c68fb426d4ac05414fb402aa1f30f4c98df103ad)
    Signed-off-by: Tathagata Das <[email protected]>

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #17239: Using map function in spark for huge operation

Reply via email to