[GitHub] drill pull request #584: DRILL-4884: Fix bug that drill sometimes produced I...

2016-10-24 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/584#discussion_r84831079
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/validate/IteratorValidatorBatchIterator.java
 ---
@@ -301,7 +301,7 @@ public IterOutcome next() {
   "Incoming batch [#%d, %s] has an empty schema. This is 
not allowed.",
   instNum, batchTypeName));
 }
-if (incoming.getRecordCount() > MAX_BATCH_SIZE) {
+if (incoming.getRecordCount() >= MAX_BATCH_SIZE) {
--- End diff --

Drill requires that batch with no selection vector(SV), and batch with SV2 
is bounded by 65536.  This requirement is valid across the entire Drill code 
base. What this IteratorVAlidator tries to enforce is to make sure every 
incoming batch meet this requirement, when assertion is enabled. However, it's 
each operator's responsibility to enforce this. For instance, as you saw, each 
reader in Drill should produce a batch no larger than 65536. If you develop a 
new storage plugin with a new reader, then the new reader should enforce this 
rule as well. 

Therefore, in your situation where you develop a new reader, the right 
approach is that you need make sure reader produces batch no larger than this 
threshold.  

The original code IteratorValidatorBatchIterator.java should be fine.  For 
the repo I tried, I feel the fix should be in LimitRecordBatch.java. As you 
indicated earlier, the index "i" is defined as char, which is not right.

Would you like to modify your patch ?




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (DRILL-4961) Schema change error due to a missing column in a Json file

2016-10-24 Thread Boaz Ben-Zvi (JIRA)
Boaz Ben-Zvi created DRILL-4961:
---

 Summary: Schema change error due to a missing column in a Json file
 Key: DRILL-4961
 URL: https://issues.apache.org/jira/browse/DRILL-4961
 Project: Apache Drill
  Issue Type: Bug
  Components: Execution - Flow
Affects Versions: 1.8.0
Reporter: Boaz Ben-Zvi


A missing column in a batch defaults to a (hard coded) nullable INT (e.g., see 
line 128 in ExpressionTreeMaterializer.java), which can cause a schema conflict 
when that column in another batch has a conflicting type (e.g. VARCHAR).

To recreate (the following test also created DRILL-4960 ; which may be related) 
:  Run a parallel aggregation over two small Json files (e.g. copy twice 
contrib/storage-mongo/src/test/resources/emp.json ) where in one of the files a 
whole column was eliminated (e.g. "last_name").

0: jdbc:drill:zk=local> alter session set planner.slice_target = 1;
+---++
|  ok   |summary |
+---++
| true  | planner.slice_target updated.  |
+---++
1 row selected (0.091 seconds)
0: jdbc:drill:zk=local> select first_name, last_name from `drill/data/emp` 
group by first_name, last_name;
Error: SYSTEM ERROR: SchemaChangeException: Incoming batches for merging 
receiver have different schemas!

Fragment 1:0

[Error Id: 1315ddc5-5c31-404f-917b-c7a082d016cf on 10.250.57.63:31010] 
(state=,code=0)

The above used a streaming aggregation; when switching to hash aggregation the 
same error manifests differently:

0: jdbc:drill:zk=local> alter session set `planner.enable_streamagg` = false;
+---++
|  ok   |  summary   |
+---++
| true  | planner.enable_streamagg updated.  |
+---++
1 row selected (0.083 seconds)
0: jdbc:drill:zk=local> select first_name, last_name from `drill/data/emp` 
group by first_name, last_name;
Error: SYSTEM ERROR: IllegalStateException: Failure while reading vector.  
Expected vector class of org.apache.drill.exec.vector.NullableIntVector but was 
holding vector class org.apache.drill.exec.vector.NullableVarCharVector, field= 
last_name(VARCHAR:OPTIONAL)[$bits$(UINT1:REQUIRED), 
last_name(VARCHAR:OPTIONAL)[$offsets$(UINT4:REQUIRED)]] 

Fragment 2:0

[Error Id: 58d0-3bfe-4197-b4bd-44f9d7604d77 on 10.250.57.63:31010] 
(state=,code=0)
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (DRILL-4954) allTextMode in the MapRDB plugin always return nulls

2016-10-24 Thread Sudheesh Katkam (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-4954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sudheesh Katkam resolved DRILL-4954.

Resolution: Fixed

Fixed in 
[4efc9f2|https://github.com/apache/drill/commit/4efc9f248ef7ef4b86660a1a73a9f44662c082ba]

> allTextMode in the MapRDB plugin always return nulls
> 
>
> Key: DRILL-4954
> URL: https://issues.apache.org/jira/browse/DRILL-4954
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - MapRDB
>Affects Versions: 1.8.0
> Environment: MapRDB
>Reporter: Boaz Ben-Zvi
>Assignee: Smidth Panchamia
> Fix For: 1.9.0
>
>
> Setting the "allTextMode" option to "true" in the MapR fs plugin, like:
>   "formats": {
> "maprdb": {
>   "type": "maprdb",
>   "allTextMode": true
> }
> makes the returned results null. Here’s an example:
> << default plugin, unchanged >>
> 0: jdbc:drill:> use mfs.tpch_sf1_maprdb_json;
> +---++
> |  ok   |summary |
> +---++
> | true  | Default schema changed to [mfs1.tpch_sf1_maprdb_json]  |
> +---++
> 1 row selected (0.153 seconds)
> 0: jdbc:drill:> select typeof(N_REGIONKEY) from nation limit 1;
> +-+
> | EXPR$0  |
> +-+
> | BIGINT  |
> +-+
> 1 row selected (0.206 seconds)
> 0: jdbc:drill:> select N_REGIONKEY from nation limit 2;
> +--+
> | N_REGIONKEY  |
> +--+
> | 0|
> | 2|
> +--+
> 2 rows selected (0.254 seconds)
> << plugin changed to all text mode (as shown above) >>
> 0: jdbc:drill:> select typeof(N_REGIONKEY) from nation limit 1;
> +-+
> | EXPR$0  |
> +-+
> | NULL|
> +-+
> 1 row selected (0.321 seconds)
> 0: jdbc:drill:> select N_REGIONKEY from nation limit 2;
> +--+
> | N_REGIONKEY  |
> +--+
> | null |
> | null |
> +--+
> 2 rows selected (0.25 seconds)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (DRILL-4894) Fix unit test failure in 'storage-hive/core' module

2016-10-24 Thread Sudheesh Katkam (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sudheesh Katkam resolved DRILL-4894.

   Resolution: Fixed
Fix Version/s: 1.9.0

Fixed in 
[f3c26e3|https://github.com/apache/drill/commit/f3c26e34e3a72ef338c4dbca1a0204f342176972]

> Fix unit test failure in 'storage-hive/core' module
> ---
>
> Key: DRILL-4894
> URL: https://issues.apache.org/jira/browse/DRILL-4894
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Aditya Kishore
>Assignee: Aditya Kishore
> Fix For: 1.9.0
>
>
> As part of DRILL-4886, I added `hbase-server` as a dependency for 
> 'storage-hive/core' which pulled older version (2.5.1) of some hadoop jars, 
> incompatible with other hadoop jars used by drill (2.7.1).
> This breaks unit tests in this module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (DRILL-3178) csv reader should allow newlines inside quotes

2016-10-24 Thread Sudheesh Katkam (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sudheesh Katkam resolved DRILL-3178.

   Resolution: Fixed
Fix Version/s: (was: Future)
   1.9.0

Fixed in 
[42948fe|https://github.com/apache/drill/commit/42948feb4a45f98f3d116d2e2a765cc3fadb5937]

> csv reader should allow newlines inside quotes 
> ---
>
> Key: DRILL-3178
> URL: https://issues.apache.org/jira/browse/DRILL-3178
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CSV
>Affects Versions: 1.0.0
> Environment: Ubuntu Trusty 14.04.2 LTS
>Reporter: Neal McBurnett
>Assignee: F Méthot
> Fix For: 1.9.0
>
> Attachments: drill-3178.patch
>
>
> When reading a csv file which contains newlines within quoted strings, e.g. 
> via
> select * from dfs.`/tmp/q.csv`;
> Drill 1.0 says:
> Error: SYSTEM ERROR: com.univocity.parsers.common.TextParsingException:  
> Error processing input: Cannot use newline character within quoted string
> But many tools produce csv files with newlines in quoted strings.  Drill 
> should be able to handle them.
> Workaround: the csvquote program (https://github.com/dbro/csvquote) can 
> encode embedded commas and newlines, and even decode them later if desired.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-10-24 Thread Sudheesh Katkam (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sudheesh Katkam resolved DRILL-4653.

   Resolution: Fixed
Fix Version/s: (was: Future)
   1.9.0

Fixed in 
[db48298|https://github.com/apache/drill/commit/db48298920575cb1c2283e03bdfc7b50e83ae217]

> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: 1.9.0
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (DRILL-4369) Database driver fails to report any major or minor version information

2016-10-24 Thread Sudheesh Katkam (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-4369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sudheesh Katkam resolved DRILL-4369.

Resolution: Fixed

> Database driver fails to report any major or minor version information
> --
>
> Key: DRILL-4369
> URL: https://issues.apache.org/jira/browse/DRILL-4369
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Client - JDBC
>Affects Versions: 1.4.0
>Reporter: N Campbell
>Assignee: Laurent Goujon
> Fix For: 1.9.0
>
>
> Using Apache 1.4 Drill
> The DatabaseMetadata.getters to obtain the Major and Minor versions of the 
> server or JDBC driver return 0 instead of 1.4.
> This prevents an application from dynamically adjusting how it interacts 
> based on which version of Drill a connection is accessing.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] drill pull request #584: DRILL-4884: Fix bug that drill sometimes produced I...

2016-10-24 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/584#discussion_r84801937
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/validate/IteratorValidatorBatchIterator.java
 ---
@@ -301,7 +301,7 @@ public IterOutcome next() {
   "Incoming batch [#%d, %s] has an empty schema. This is 
not allowed.",
   instNum, batchTypeName));
 }
-if (incoming.getRecordCount() > MAX_BATCH_SIZE) {
+if (incoming.getRecordCount() >= MAX_BATCH_SIZE) {
--- End diff --

I'm not sure if this is the right fix for this IOB problem. 

1. IteratorValidator is only inserted when assertion is enabled [1].  
Fixing only in IteratorValidatorBatchIterator will mean that the issue will be 
still there if assertion is disabled.
2. Even we do turn on assertion,  will the query hit IllegalStateException, 
in stead of IOB?

Can you try run the query when assertion is off / on, and see if the query 
is successful in both cases?

[1] 
https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/ImplCreator.java#L72-L74
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


isDateCorrect field in ParquetTableMetadata

2016-10-24 Thread Jinfeng Ni
Hello All,

DRILL-4203 addressed the date field issue.  In the fix, it introduced
a new field in ParquetTableMetadata_v2 : isDateCorrect.  I have some
difficulty in understanding the meaning of this field.

According to [1], this field is set to false, when Drill gets parquet
metadata from parquet footer.  This field is  set to true in code flow
of [2] and [3], when Drill gets parquet metadata from meta data cache.

Questions I have:
1.  If the parquet files are generated with Drill after DRILL-4203,
Drill still thinks date field is NOT correct (isDateCorrect = false)?
2.  Why does this filed have nothing to do with "autoCorrection" flag
[4]?  If someone turns off autoCorrection, will it have impact on this
"isDateCorrect" flag ?

Thanks in advance for any input,

Jinfeng


[1] 
https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java#L932
[2] 
https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java#L936
[3] 
https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java#L187
[4] 
https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java#L354-L355


[jira] [Created] (DRILL-4959) Drill 8.1 not able to connect to S3

2016-10-24 Thread Gopal Nagar (JIRA)
Gopal Nagar created DRILL-4959:
--

 Summary: Drill 8.1 not able to connect to S3
 Key: DRILL-4959
 URL: https://issues.apache.org/jira/browse/DRILL-4959
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.8.0
Reporter: Gopal Nagar


Hi Team,

I have followed below details to integrate Drill with AWS S3. Query keep 
running for hours and doesn't display any output (I am querying only 2 row file 
from S3).

Reference link
---
https://abhishek-tiwari.com/post/reflections-on-apache-drill
https://drill.apache.org/docs/s3-storage-plugin/ 

Query Format (Tried from UI & CLI)

select * from `s3`.`hive.csv` LIMIT 10;
select * from `s3`.`bucket_name/hive.csv` LIMIT 10;

After seeing below log, I tried including jets3t-0.9.3.jar in jars directory 
but it doesn't fix my problem.



Log Details
--
2016-10-24 17:00:02,461 [27f1c1ec-d82e-ba2a-2840-e7104320418f:foreman] INFO  
o.a.drill.exec.work.foreman.Foreman - Query text for query id 
27f1c1ec-d82e-ba2a-2840-e7104320418f: select * from `s3`.`hive.csv` LIMIT 10
2016-10-24 17:00:02,479 [drill-executor-39] ERROR 
o.a.d.exec.server.BootStrapContext - 
org.apache.drill.exec.work.foreman.Foreman.run() leaked an exception.
java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DRILL-4958) Union All qiery fails stating that parquet files are schema less

2016-10-24 Thread Khurram Faraaz (JIRA)
Khurram Faraaz created DRILL-4958:
-

 Summary: Union All qiery fails stating that parquet files are 
schema less
 Key: DRILL-4958
 URL: https://issues.apache.org/jira/browse/DRILL-4958
 Project: Apache Drill
  Issue Type: Bug
  Components: Execution - Flow
Affects Versions: 1.9.0
Reporter: Khurram Faraaz


UNION ALL query over parquet files fails and reports that the Union-All was 
over schema-less tables.
Parquet files are not with out a schema, they do have metadata. We need to fix 
this. Postgres returns results for the same query on same data.

Drill 1.9.0
git commit ID : a29f1e29

{noformat}
0: jdbc:drill:schema=dfs.tmp> select * from `t_alltype.parquet` t1 union all 
select * from `t_alltype.parquet` t2;
Error: UNSUPPORTED_OPERATION ERROR: Union-All over schema-less tables must 
specify the columns explicitly
See Apache Drill JIRA: DRILL-2414


[Error Id: b6069bdc-8697-4578-a799-802ff3e80f00 on centos-01.qa.lab:31010] 
(state=,code=0)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)