I believe that probably, *every* convention will have its drawbacks. using a
factory can help on one hand, but it can also cause great confusion if things
get mixed. It also makes things more complex. If we clearly document the choice
made, I will live with that.
My main point is that we should try to write and document the software in such
way that MetaModel users will not get confused. I like the quotes idea, since
that will allow the user to explicitely express what is intended. But then,
lets extend it to something like this:
"schema_name"."table_name"."column_name"
Where schema_name and table_name can contain dots ("."). (I guess column
names cannot...)
I hope you don't mind me rambling about this...
kind regards,
Hans
-----Original Message-----
From: Kasper Sørensen [mailto:[email protected]]
Sent: Wednesday, August 14, 2013 2:59 PM
To: [email protected]
Subject: Re: [DISCUSS] use folder name as schema name for file based
DataContexts
With those different preferences, we could even consider making something like
a "TableNameFactory" which converts filenames into table names. But I guess the
crucial point is which default convention to use.
Underscoring makes it a bit cleaner to look at the column or table paths, but
it also makes the representation less direct. A user could start wondering if
there are other characters than dots that will be replaced by underscores etc.
It should be noted that MM's parser does support dots in both table and schema
names, so this is probably mostly a question of aesthetics.
The ambiguity that you point out is also interesting. So far I haven't seen it
appear in real life, but technically it could occur that you had two pairs of
schemas and tables that would generate a ambigious table path. For instance:
Schema: foo.bar
Table: baz
and
Schema: foo
Table: bar.baz
The parser would currently favor the second schema ("foo") since it
incrementally tries for schema/table/column matches with every dot-separated
token. An improvement to the parser would be to allow quote characters, so that
you could express your table path like this
then:
"foo.bar".baz
Also I want to note that some databases do support dots in schema/table/column
names, so this ambiguity can (although rarely) also occur in a RDBMS or other
data sources. It would also be quite common with some separator (not necesarily
a dot) in NoSQL database column names, to indicate a nested field. In HBase for
instance they are referred using colon, like this: "columnFamily:column".
All in all I am mostly feeling like preserving the dots from the filenames, but
am also very curious what other people think!
2013/8/14 Hans Drexler <[email protected]>:
> Hi,
>
> First I agree with bumping this issue. When at the customer, this thing
> caused a lot of time spent in figuring out what was going on. I am not sure
> if I like the extension as part of the table name, because:
> - I would never create a table in a relational database with a dot in
> the name
> - It creates a ambiguity. If you have a "full" path name to a column, like "
> documents.people.csv.name ", then it is not clear if the schema name is
> "documents.people" and the table name is "csv", or that the schema name is
> "documents" and the table name is "people.csv". It seems natural to me that
> schema names contain dots, but not table names.
>
> Alternatives:
> - Leave the extension out of the name (probably not acceptable, because then
> you can no longer have two "tables" differing only in extension). Although I
> must say that personally I think this would be the best solution.
>
> - Use a conventional name, like:
> Schema name: Folder name
> Table name: The filename, including extension (all dots replaced by
> underscores).
> Resulting in e.g. a column path like this:
> documents.people_csv.name
>
> At the customer site, the file I needed to use was actually called like this
> pattern: "bar/FOO.PEOPLE.IN.FILE". Using the convention, this would become:
> bar.FOO_PEOPLE_IN_FILE
>
> IMHO this is preferable to "bar.foo.people.in.file"
>
> The problem is of course that it would now be impossible to have
> another file "bar/FOO_PEOPLE_IN_FILE" :-(
>
> I am happy to hear other peoples thougths.
>
>
> Hans
>
>
> -----Original Message-----
> From: Kasper Sørensen [mailto:[email protected]]
> Sent: Wednesday, August 14, 2013 10:18 AM
> To: [email protected]
> Subject: Re: [DISCUSS] use folder name as schema name for file based
> DataContexts
>
> Rats, made a mistake in that diff. The Gist has been updated [1] and now
> contains the ResourceUtils class which was missing before.
> [1] https://gist.github.com/kaspersorensen/6210970
>
> 2013/8/12 Kasper Sørensen <[email protected]>:
>> Here's a proposed patch (implemented for CSV and fixedwidth files
>> which are the modules that implemented the old schema naming pattern):
>> https://gist.github.com/kaspersorensen/6210970
>>
>> 2013/8/10 Kasper Sørensen <[email protected]>:
>>> https://issues.apache.org/jira/browse/METAMODEL-4
>>>
>>> 2013/8/10 Henry Saputra <[email protected]>:
>>>> What is the JIRA for this one?
>>>>
>>>>
>>>> On Fri, Aug 9, 2013 at 2:26 AM, Manuel van den Berg <
>>>> [email protected]> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> (shouldn't I just vote on the Jira for this?)
>>>>>
>>>>> manuel
>>>>>
>>>>> > -----Original Message-----
>>>>> > From: Kasper Sørensen [mailto:[email protected]]
>>>>> > Sent: Friday, August 09, 2013 9:03
>>>>> > To: [email protected]
>>>>> > Subject: Re: [DISCUSS] use folder name as schema name for file
>>>>> > based DataContexts
>>>>> >
>>>>> > Allow me to bump this issue (it's my impression that more people
>>>>> > have
>>>>> joined
>>>>> > in a bit late, after this topic was posted).
>>>>> >
>>>>> > I think this is one of the more important issues that I would
>>>>> > want to fix before we make our first release at Apache.
>>>>> >
>>>>> > 2013/7/24 Kasper Sørensen <[email protected]>:
>>>>> > > Right now we have this slightly odd naming convention for
>>>>> > > schema and table names when building metadata for e.g. a CSV
>>>>> > > file or a fixed width value file.
>>>>> > >
>>>>> > > Schema name: The filename, including file extension.
>>>>> > > Table name: The filename without extension.
>>>>> > > Resulting in e.g. a column path like this:
>>>>> > > people.csv.people.name
>>>>> > >
>>>>> > > I suggest we change it to this convention:
>>>>> > >
>>>>> > > Schema name: Folder name
>>>>> > > Table name: The filename, including file extension.
>>>>> > > Resulting in e.g. a column path like this:
>>>>> > > documents.people.csv.name
>>>>> > >
>>>>> > > Why do I think this would be an improvement?
>>>>> > >
>>>>> > > 1) Because this would first of all make a kind of sense to the
>>>>> > > user to see the file system's hierarchy reflected in the schema model.
>>>>> > > 2) Because it allows us to make these DataContext's operate
>>>>> > > not on a single file, but on a directory of files. I have seen
>>>>> > > this quite a number of times by now that users of MetaModel, or users
>>>>> > > of e.g.
>>>>> > > DataCleaner, which uses MetaModel quite heavily, wants to do
>>>>> > > this sort
>>>>> of
>>>>> > stuff.
>>>>> > > 3) The removing of the file extension stuff is kind of broken
>>>>> > > and a strange convention in the first place.
>>>>> > >
>>>>> > > While this doesn't really break backwards compatibility in
>>>>> > > terms of Java code, it would break configuration files and
>>>>> > > other stuff of applications that use MetaModel. But I do
>>>>> > > believe that can be communicated and handled through carefully
>>>>> > > explaining the new convention on the migration page (that I recently
>>>>> > > started writing [1]).
>>>>> > >
>>>>> > > What do you think?
>>>>> > >
>>>>> > > [1]
>>>>> > > http://wiki.apache.org/metamodel/MigratingFromEobjectsMetaMode
>>>>> > > l
>>>>>