[ 
https://issues.apache.org/jira/browse/DRILL-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14996511#comment-14996511
 ] 

Bhallamudi Venkata Siva Kamesh commented on DRILL-3524:
-------------------------------------------------------

Hari,
 I agree with you and it may be handy. But here is the problem. 
 
When you start providing probable schema, by scanning first few documents, all 
are your queries will be validated against that schema and this may lead to 
failure of some of  queries (which we should not do as Mongo is for schema less 
documents). Also sometimes this may lead to incompatible types as well. 
However, when we say its dynamic schema, schema will be constructed at query's 
runtime.
 
 
As as example, by scanning the first few documents, we discovered that a 
collection has schema the following schema

{code}
{ f1 : int, f2: float, f3: String}
{code}

what would we do when we have the following query

{code}
select f4, f5 from collection
{code}

and assume that f4, f5 fields are there in latter documents. This causes the 
query to fail.
So, I feel, the schema should be driven from the application, in case if we 
want to enforce any for a given collection, but not from data, at least for 
Mongo.

Any thoughts?

> Drill proper DESCRIBE support for MongoDB
> -----------------------------------------
>
>                 Key: DRILL-3524
>                 URL: https://issues.apache.org/jira/browse/DRILL-3524
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Metadata, Storage - MongoDB
>    Affects Versions: 1.1.0
>            Reporter: Hari Sekhon
>             Fix For: Future
>
>
> Request to add full DESCRIBE support for MongoDB collections.
> I understand this may be difficult / sub-optimal due to the flexible schema 
> nature of Mongo docs but if you can tabulate results when reading directly 
> from MongoDB for which you have read the field names, then it's also possible 
> to extract all field names to present for the describe command, albeit an 
> inefficient scan to do so.
> Currently describe returns a pseudo / inaccurate / unhelpful metadata:
> {code}+--------------+------------+--------------+
> | COLUMN_NAME  | DATA_TYPE  | IS_NULLABLE  |
> +--------------+------------+--------------+
> | *            | ANY        | YES          |
> +--------------+------------+--------------+{code}
> Perhaps you could extend DESCRIBE to scan the first few dozen docs by default 
> to create a merged schema as well as adding an optional argument to the 
> describe command to allow for scanning a user-specified number of docs from 
> which to describe the schema, or an ALL argument keyword to describe to scan 
> all docs in a collection to get the complete global schema for the collection?
> In case of schema evolution it might be an interesting option to additionally 
> read the newest and oldest records, maybe the first and last records by ID 
> etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to