Re: freemarker-generator: Improving the input documents concept

Siegfried Goeschl Thu, 05 Mar 2020 12:36:23 -0800

Hi Daniel,

The introduction of named `Datasource` allows to simplify / streamline afew things


* I have a meaningful user-supplied name

* I can pass additional configuration information as already implementedwith `charset` and `contenttype` and this would also allow configure a`CSV Datasource`, e.g.`users=./data/users.csv#format=default&header=true&delimeter=TAB` whichcan be readily parses* Currently the name of datasources are are taken from their relativefile name - might make sense to drop that but I need to contemplate :-)

Regarding the "global mode" and "output generators files" - I'm sorry,but I'm not getting it

* I refined thehttps://gist.github.com/sgoeschl/b09b343a761b31a6c790d882167ff449 tomake my points more clearly* Do you think of defining explicit "output generator file" containing`datasources, `templates` and `outputs` - yes that could be done butdoes not feel like an interactive command line tool any longer



Regarding "more idiomatic FTL usage"

* Yes, I need to dive into custom template models or whatever it iscalled :-)



Something we need to iron out is a release policy

* Currently we have little agreement how the CLI should look like orbehave* I think you are leaning towards a 1.0 release why I favour 0.x.y tohave room to make mistakes / experiments* I personally see the possibility that we don't get a release out -"perfect is the enemy of good"

How would you like to handle the problem - can we agree on minimalfeature set worthy a release?


Thanks in advance,

Siegfried Goeschl


On 1 Mar 2020, at 11:33, Daniel Dekany wrote:

Actually not recommended but we have named data sources for less than24
hours
Sorry, not sure what that means. Anyway, my "vote" is let's not give
automatic names if that's not recommended to utilize. I mean, in casewehappen to agree on that, why leave it there. Especially ifautomatically
chosen names can clash with explicitly given ones, that would be a
trouble. (I'm not sure right now if they can... the path we use asthe
name can be realtive? Then it realistically can.)
This is a command line tool where we have little idea what the userwill do
or abuse
No matter how much/little we know, we firmly put our bets by releasing
something. So if some feature is certainly not right, that's enough tonot
have it, I think.

How does a "data loader" knows that it is responsible to load a file

What should as "CSV data loader" should do - parse it into a list of
records or stream one by one?
I think I was misunderstood here. It's not about some kind ofauto-magic.It's about where do you specify what to load and how, and in whatformat doyou specify that. Of course, you must specify the data source(basically an
URI for now as I saw), the rough format (CSV), and the format options
(separator character, etc.), and other freemarker-generator loadingoptions(like which CSV columns are numbers, which are dates, with whatformat,
what counts as null, etc.).
What was confusing in what I said much earlier is probably that youdon'tneed a global "--mode". That just means that you can have multiple"modes"
in the same run, not that you need some big auto-magic. And that they
aren't really "modes" then... I think it's just natural that you canhavedifferent kind of "output generator" files in the same run. Why forcetheassumption that you don't, especially considering that they will mightwantto access common data (which you don't want to load again and again,for
each run of the different --mode-s you need). Of course, as you might
select files with wildcards (or by specifying a whole directory, orwith
some Maven matcher), you just can't directly associate the data loader
options to the individual data sources. Instead you can say elsewherethat*.csv inside this explicit "group", or with this file name pattern, istobe loaded like this. That's what you might perceived as auto-magic.It's
just mass-producing data loaders for "cattle" files.
How to handle the case if you have multiple potential data loaders fora
single file?
As per above, that's just two data loaders referring to the same data
source, so, nothing special.
As of the current state of things, this is how I'm supposed to load aCSV,
in the template itself (if I'm not outdated/mistaken):

<#assign cvsFormat = CSVTool.formats.DEFAULT.withHeader()>
<#assign foos = CSVTool.parse(Datasources.get("foos"),cvsFormat).records><#assign bars = CSVTool.parse(Datasources.get("barb"),cvsFormat).records>
It will worth exploring how to make these look more "idiomatic" FTL(given
this is an "official" FM product now, I think, we should show how it's
done), and nicer in general. Point for now is, that's basically two
data-loaders interwoven with the template there. Because they are
interwoven like that, you can't reuse what they loaded for anothertemplate
execution.
That's comes down to personal preferences, e.g. chown uses"owner[:group] "
Yeah, but XML namespaces, Java, C, etc. all use<parent><operator><child>,so, I think, that clicks for more of our potential users. So let's beton
what clicks for more users.
Besides, I challenged the very idea that we need both groups andnames. :)Saying that it's simpler and less opinioned (more flexible) to havejust
multiple names (like tags). What's the end of that?

On Sun, Mar 1, 2020 at 9:47 AM Siegfried Goeschl <
[email protected]> wrote:
HI Daniel,

Please see my comments below

Thanks in advance,

Siegfried Goeschl
On 29.02.2020, at 21:02, Daniel Dekany <[email protected]>wrote:
I try to provide a useful name even when the content is coming froman
URL
When is it recommended to rely on that though? Because utilizingthat
means
that renaming a data source file can break the process, even if youcallfreemarker-cli with the up to date file name. And if that happensdepends
on what you (or an other random colleague!) have dug inside the
templates.
So I guess we better just don't support this. Less code and lessthings
to
document too.
Actually not recommended but we have named data sources for less than24
hours
I think we have a different understanding what a "Document" /
"Datasource
/ DataSource" should do
Thing is, eventually (most certainly pre-1.0, as it influences
architecture), certain needs will have to addressed, somehow. Thenwe
will
see what "things" we really need. For now I though we need "things"that
are much more than paths, and encapsulate the "how to load the data"
aspect. I called them data sources, but maybe we should called them"data
loaders" to free up data sources for the more primitive thing. Some
needs/doubts to address, *later*: Is it really the best approach for
users
to load/parse data sources programmatically (that coded is writtenin
FTL,
inside the templates)? Also, is the template the right place fordoingthat, because, when multiple templates (or just multiple template*runs*
of
the same template, each generating a different output file) needscommondata, they shouldn't load it again and again. Also, different topic,can
we
handle the case "transparently" enough when the data is not comingfrom a
file?
This is a command line tool where we have little idea what the userwill
do or abuse
* How does a "data loader" knows that it is responsible to load afile* What should as "CSV data loader" should do - parse it into a listof
records or stream one by one?
* How to handle the case if you have multiple potential data loadersfor a
single file?
I'm leaning towards building blocks where the user controls the workto be
done even it requires one to two extra lines of FTL code
The joy of programming - I did not intend to use "name:group"together
with
wildcards :-)
For a CLI tool, I guess we agree that it should work. So maybe, likethis
(here logs and foos meant to be "groups"):
--data-source logs file1.log file2.log fileN.log --data-sourcefoos
foo1.csv foo2.csv fooN.csv  --data-source bar bar.xlsx
It so happens that here you don't really have a good control aboutthenumber of files associated to the name, so, maybe yet another reasonto
not
differentiate names and groups.
I Disagree here - I think using a name would be used more often. Iadded
the "group" as an afterthought since some grouping could be useful
We do agree in that. What I said is that the *syntax* should be sothat
the
group comes first. It's still optional. Like this:
--data-source group:name /somewhere
--data-source name /somewhere
That's comes down to personal preferences, e.g. chown uses"owner[:group] "
On Sat, Feb 29, 2020 at 7:34 PM Siegfried Goeschl <
[email protected]> wrote:
HI Daniel,

Seem my comments below

Thanks in advance,

Siegfried Goeschl
On 29.02.2020, at 19:08, Daniel Dekany <[email protected]>
wrote:
FREEMARKER-135 freemarker-generator-cli: Support user-suppliednames
for
datasources
So, I can do this to have both a name an a group associated to adata
source:
--datasource someName:someGroup=somewhere/something
Correct
Or if I only want a name, but not a group (or an "" groupactually -
bug?), then:
--datasource someName=somewhere/something
Correct
Or if only a group but not a name (or a "" name actually) then:
--datasource :someGroup=somewhere/something
Mhmm, that would be unintended functionality from my side - current
approach is that every "Document" / "Datasource / DataSource" isnamed
A name must identify exactly 1 data source, while a groupidentifies a
list
of data sources.
No, every "Document" / "Datasource / DataSource" has a namecurrently
but
uniqueness is not enforced. Only if you want to get a "Document" /
"Datasource / DataSource" with it's exact name I checked forexactly onesearch hit and throw an exception. I try to provide a useful nameeven
when
the content is coming from an URL or STDIN (and I will probably add
environment variables as "Document" / "Datasource / DataSource",e.g
configuration in the cloud as JSON content passed as environment
variable)
Is that this idea, that the a data source can be part of a group,and
then
is also possibly identifiable with a name comes from an use case?I
mean,
it's possibly important somewhere, but if so, then it's strangethat
you
can put something into only a single group. If we need this kindof
thing,
then perhaps you should be just allowed to associate the datasource
with a
list of names (kind of like tagging), and then when the templatewants
to
get something by name, it will tell there if it expects exactlyone or
a
list of data sources. Then you don't need to introduce two termsin thedocumentation either (names and groups). Again, if we want this atall,instead of just going with a data source that itself gives a list.(And
if
not, how will we handle a data source that loads from a non-file
source?)
I actually thought of implementing tagging but considered a "group"
sufficient.
* If you don't define anything everything goes into the "default"group* For individual documents you can define a name and an optionalgroup
I think we have a different understanding what a "Document" /
"Datasource
/ DataSource" should do

* It is a dumb
* It is lazy since data is only loaded on demand
* There is no automagic like "oh, this is a JSON file, so let's goto
the
JSON tool and create a map readily accessible in the data model"
Note that the current command line syntax doesn't work well withshell
wildcard expansion. Like this:
--datasource :someGroup=logs/*.log
will try to expand ":someGroup=logs/*.log", and because it finds
nothing
(and because the rules of sh and the like is a mess), you will getthe
parameter value as is, without * expanded.
The joy of programming - I did not intend to use "name:group"together
with wildcards :-)
Also,  I think the syntax with colon should be flipped, because on
other
places foo:bar usually means that foo is the bigger unit (the
container),
and bar is the smaller unit (the child).
I Disagree here - I think using a name would be used more often. Iadded
the "group" as an afterthought since some grouping could be useful
On Sat, Feb 29, 2020 at 5:03 PM Siegfried Goeschl <
[email protected]> wrote:
Hi Daniel,

I'm an enterprise developer - bad habits die hard :-)

So I closed the following tickets and merged the branches
1) FREEMARKER-129 freemarker-generator: Merge "freemarker-cli"into
"freemarker-generator"
2) FREEMARKER-134 freemarker-generator: Rename "Document" to
"Datasource"
3) FREEMARKER-135 freemarker-generator-cli: Support user-supplied
names
for datasources

Thanks in advance,

Siegfried Goeschl
On 29.02.2020, at 12:19, Daniel Dekany <[email protected]>
wrote:
Yeah, and of course, you can merge that branch. You can evenwork on
the
master directly after all.

On Sat, Feb 29, 2020 at 12:17 PM Daniel Dekany <
[email protected]>
wrote:
But, I do recognize the cattle use case (several "faceless"files
with
common format/schema). Only, my idea is to push that complexityon
the
data
source. The "data source" concept shields the rest of the
application
from
the details of how the data is stored or retrieved. So, a data
source
might
loads a bunch of log files from a directory, and present themas a
single
big table, or like a list of tables, etc. So I want to dealwith the
cattle
use case, but the question is what part of the of architecturewill
deal
with this complication, with other words, how do you boxthings. Why
my
initial bet is to stuff that complication into the "datasource"implementation(s) is that data sources are inherently varied.Some
returns
a table-like thing, some have multiple named tables (worksheetsin
Excel),
some returns tree of nodes (XML), etc. So then, some mightreturns alist-of-list-of log records, or just a single list oflog-records
(put
together from daily log files). That way cattles don't add to
conceptual
complexity. Now, you might be aware of cases where the cattle
concept
must
be more exposed than this, and the we can't box things likethis.
But
this
is what I tried to express.
Regarding "output generators", and how that applies on thecommand
line. I
think it's important that the common core between Maven and
command-line is
as fat as possible. Ideally, they are just two syntax to set upthe
same
thing. Mostly at least. So, if you specify a template file tothe
CLI
application, in a way so that it causes it to process thattemplate
to
generate a single output, then there you have just defined an
"output
generator" (even if it wasn't explicitly called like that inthe
command
line). If you specify 3 csv files to the CLI application, in away
so
that
it causes it to generate 3 output files, then you have justdefined
3
"output generators" there (there's at least one templatespecified
there
too, but that wasn't an "output generator" itself, it was justan
attribute
of the 3 output generators). If you specify 1 template, and 3csv
files, in
a way so that it will yield 4 output files (1 for the template,3
for
the
csv-s), then you have defined 4 output generators there. If you
have a
data
source that loads a list of 3 entities (say, 3 csv files, soit's a
list of
tables then), and you have 2 templates, and you tell the CLI to
execute
each template for each item in said data source, then you havejust
defined
6 "output generators".

On Fri, Feb 28, 2020 at 11:08 AM Siegfried Goeschl <
[email protected]> wrote:
Hi Daniel,

That all depends on your mental model and work you do,
expectations,
experience :-)


__Document Handling__

*"But I think actually we have no good use case for list of
documents
that's passed at once to a single template run, so, we canjust
ignore
that complication"*
In my case that's not a complication but my daily business -I'mregularly wading through access logs - yesterday probably acouple
of
hundreds access logs across two staging sites to help trackingsome
strange API gateway issues :-)

My gut feeling is (borrowing from
https://medium.com/@Joachim8675309/devops-concepts-pets-vs-cattle-2380b5aab313
)

1. You have a few lovely named documents / templates - `pets`
2. You have tons of anonymous documents / templates to process-
`cattle`
3. The "grey area" comes into play when mixing `pets & cattle`
`freemarker-cli` was built with 2) in mind and I want to cover1)
since
it is equally important and common.


__Template And Document Processing Modes__
IMHO it is important to answer the following question : "Howmanyoutputs do you get when rendering 2 template and 3datasources?
Two,
Three or Six?"

Your answer is influenced by your mental model / experience

* When wading through tons of CSV files, access logs, etc. the
answer
is
"2"
* When doing source code generation the obvious answer is "6"
* Can't image a use case which results in "3" but I'm prettysure
we
will encounter one

__Template and document mode probably shouldn't exist__
That's hard for me to fully understand - I definitely lackyour
insights
& experience writing such tools :-)
Defining the `Output Generator` is the underlying model forthe
Maven
plugin (and probably FMPP).
I'm not sure if this applies for command lines at least not inthe
way
I
use them (or would like to use them)


Thanks in advance,

Siegfried Goeschl

PS: Can/shall I merge the PR to bring in `freemarker-cli`?


On 28 Feb 2020, at 9:14, Daniel Dekany wrote:
Yeah, "data source" is surely a too popular name, but forreason.
Anyone
has other ideas?

As of naming data sources and such. One thing I was wondering
about
back
then is how to deal with list of documents given to atemplate,
versus
exactly 1 document given to a template. But I think actuallywe
have
no
good use case for list of documents that's passed at once toa
single
template run, so, we can just ignore that complication. Adocument
has
a
name, and that's always just a single document, not acollection,
as
far as
the template is concerned. (We can have multiple documentsper
run,
but
those normally yield separate output generators, so it'sstill
only
one
document per template.) However, we can have data sourcetypes
(document
types with old terminology) that collect together multipledata
files.
So
then that complexity is encapsulated into the data sourcetype,
and
doesn't
complicate the overall architecture. That's another case whena
data
source
is not just a file. Like maybe there's a data source typethat
loads
all
the CSV-s from a directory, into a single big table (I hadsuch
case),
or
even into a list of tables. Or, as I mentioned already, adata
source
is
maybe an SQL query on a JDBC data source (and we got thefirst
term
clash... JDBC also call them data sources).

Template and document mode probably shouldn't exist from user
perspective
either, at least not as a global option that must apply to
everything
in a
run. They could just give the files that define the "output
generators",
and some of them will be templates, some of them are datafiles,
in
which
case a template need to be associated with them (and therecan be
a
couple
of ways of doing that). And then again, there are the caseswhere
you
want
to create one output generator per entity from some datasource.
On Fri, Feb 28, 2020 at 8:23 AM Siegfried Goeschl <
[email protected]> wrote:
Hi Daniel,
See my comments below - and thanks for your patience andinput
:-)
*Renaming Document To DataSource*

Yes, makes sense. I tried to avoid since I'm using
javax.activation
and
its DataSource.

*Template And Document Mode*
Agreed - I think it is a valuable abstraction for the userbut it
is
not
an implementation concept :-)

*Document Without Symbolic Names*
Also agreed and it is going to change but I have not settledmy
mind
yet
what exactly to implement.

Thanks in advance,

Siegfried Goeschl

On 28 Feb 2020, at 1:05, Daniel Dekany wrote:

A few quick thoughts on that:

- We should replace the "document" term with something more
speaking.
It
doesn't tell that it's some kind of input. Also, most ofthese
inputs
aren't something that people typically call documents. Likea csv
file, or
a database table, which is not even a file (OK we don'tsupport
such
thing
at the moment). I think, maybe "data source" is a safeenough
term.
(It
also rhymes with data model.)
- You have separate "template" and "document" "mode", that
applies
to
a
whole run. I think such specialization won't be helpful. Wecould
just say,
on the conceptual level at lest, that we need a set of"outputsgenerators". An output generator is an object (in the API)that
specifies a
template, a data-model (where the data-model is possibly
populated
with
"documents"), and an output "sink" (a file path, or stdout),and
can
generate the output itself. A practical way of defining the
output
generators in a CLI application is via a bunch of files,each
defining an
output generator. Some of those files is maybe a template(that
you
can
even detect from the file extension), or a data file that we
currently call
a "document". They could freely mix inside the same run. Ihave
also
met
use case when you have a single table (single "document"),and
each
record
in it yields an output file. That can also be described insome
file
format, or really in any other way, like directly in commandline
argument,
via API, etc.
- You have multiple documents without associated symbolicalname
in
some
examples. Templates can't identify those then in a well
maintainable
way.
The actual file name is often not a good identifier, canchange
over
time,
and you might don't even have good control over it, like you
already
receive it as a parameter from somewhere else, or someone
moves/renames
that files that you need to read. Index is also not verygood,
but
I
have
written about that earlier.


On Wed, Feb 26, 2020 at 9:33 AM Siegfried Goeschl <
[email protected]> wrote:

Hi folks,
still wrapping my side around but assembled some thoughtshere -
https://gist.github.com/sgoeschl/b09b343a761b31a6c790d882167ff449
Thanks in advance,

Siegfried Goeschl


On 23 Feb 2020, at 23:14, Daniel Dekany <[email protected]>
wrote:
What you are describing is more like the angle that FMPPtook
initially,
where templates drive things, they generate the output for
themselves
(even
multiple output files if they wish). By default output filesname
(and
relative path) is deduced from template name. There was alsoa
global
data-model, built in a configuration file (or equally, builtvia
command
line arguments, or both mixed), from which templates getwhatever
data

they

are interested in. Take a look at the figures here:
http://fmpp.sourceforge.net/qtour.html. Later, this conceptwas
generalized
a bit more, because you could add XML files at the sameplace
where
you
have the templates, and then you could associate transform
templates
to

the

XML files (based on path pattern and/or the XML document
element).
Now
that's like what freemarker-generator had initially (datafiles
drive
output, and the template is there to transform it).

So I think the generic mental model would like this:

1. You got files that drive the process, let's call them
*generator
files* for now. Usually, each generator file yields anoutput
file
(but
maybe even multiple output files, as you might saw in thelast
figure).
These generator files can be of many types, like XML, JSON,XLSX
(as
in the
original freemarker-generator), and even templates (as isthe
norm
in
FMPP). If the file is not a template, then you got a set of
transformer
templates (-t CLI option) in a separate directory, which canbe
associated
with the generator files base on name patterns, and evenbased on
content
(schema usually). If the generator file is a template (sothat's
a
positional @Parameter CLI argument that happens to be an*.ftl,
and
is

not
a template file specified after the "-t" option), then youjustTemplate.process(...) it, and it prints what the output willbe.2. You also have a set of variables, the global data-model,thatcontains commonly useful stuff, like what you now callparameters
(CLI
-Pname=value), but also maybe data loaded from JSON, XML,etc..
Those
data
files aren't "generator files". Templates just use them ifthey
need
them.
An important thing here is to reuse the same mechanism toread
and
parse
those data files, which was used in templates whentransforming
generator
files. So we need a common format for specifying how to loaddata
files.
That's maybe just FTL that #assigns to the variables, ormaybe
more
declarative format.

What I have described in the original post here was a less
generic
form

of
this, as I tried to be true with the original approach. Ithough
the
proposal will be drastic enough as it is... :) There, the"main"
document
is the "generator file" from point 1, the "-t" template isthe
transform
template for the "main" document, and the other nameddocuments
("users",
"groups") is a poor man's shared data-model from point 2
(together
with
with -PName=value).
There's further somewhat confusing thing to get right withthelist-of-documents (`DocuentList`, `NamedDocumentLists`)thing
though.
In
the model above, as per point 1, if you list multiple datafiles,
each

will
generate a separate output file. So, if you need take in alist
of
files

to
transform it to a single output file (or at least with asingle
transform
template execution), then you have to be explicit aboutthat, as
that's

not
the default behavior anymore. But it's still absolutelypossible.
Imagine
it as a "list of XLSX-es" is itself like a file format. Youneed
some
CLI
(and Maven config, etc.) syntax to express that, but that
shouldn't
be a
big deal.



On Sun, Feb 23, 2020 at 9:43 PM Siegfried Goeschl <
[email protected]> wrote:

Hi Daniel,
Good timing - I was looking at a similar problem fromdifferent
angle
yesterday (see below)
Don't have enough time to answer your email in detail now -will
do
that
tomorrow evening

Thanks in advance,

Siegfried Goeschl


===. START
# FreeMarker CLI Improvement
## Support Of Multiple Template Files
Currently we support the following combinations

* Single template and no data files
* Single template and one or more data files

But we can not support the following use case which is quite
typical
in
the cloud

__Convert multiple templates with a single data file, e.g
copying a
directory of configuration files using a JSON configuration
file__
## Implementation notes
* When we copy a directory we can remove the `ftl`extensionon
the
fly
* We might need an `exclude` filter for the copy operation
* Initially resolve to a list of template files and processone
after
another
* Need to calculate the output file location and extension
* We need to rename the existing command line parameters(see
below)
* Do we need multiple include and exclude filter?
* Do we need file versus directory filters?

### Command Line Options
```
--input-encoding : Encoding of the documents
--output-encoding : Encoding of the rendered template
--template-encoding : Encoding of the template
--output : Output file or directory
--include-document : Include pattern for documents
--exclude-document : Exclude pattern for documents
--include-template: Include pattern for templates
--exclude-template : Exclude pattern for templates
```

### Command Line Examples
```text
# Copy all FTL templates found in "ext/config" to the"/config"
directory

using the data from "config.json"

freemarker-cli -t ./ext/config --include-template *.ftl --o
/config
config.json
freemarker-cli --template ./ext/config --include-template*.ftl
--output

/config config.json

# Bascically the same using a named document "configuration"
# It might make sense to expose "conf" directly in theFreeMarker
data
model
# It might make sens to allow URIs for loading documents

freemarker-cli -t ./ext/config/*.ftl -o /config -d

configuration=config.json
freemarker-cli --template ./ext/config --include-template*.ftl
--output

/config --document configuration=config.json
freemarker-cli --template ./ext/config --include-template*.ftl
--output

/config --document configuration=file:///config.json

# Bascically the same using an environment variable as named
document
freemarker-cli -t ./ext/config --include-template *.ftl -o
/config
-d
configuration=env:///CONFIGURATION
freemarker-cli --template ./ext/config --include-template*.ftl
--output

/config --document configuration=env:///CONFIGURATION
```
=== END

On 23.02.2020, at 16:37, Daniel Dekany <[email protected]>
wrote:
Input documents is a fundamental concept infreemarker-generator,
so
we
should think about that more, and probably refine/rework howit's
done.

Currently it works like this, with CLI at least.

freemarker-cli
-t access-report.ftl
somewhere/foo-access-log.csv
Then in access-report.ftl you have to do something likethis:
<#assign doc = Documents.get(0)>
... process doc here
(The more idiomatic Documents[0] won't work. Actually, thatlead
to a
funny

chain of coincidences: It returned the string "D", then

CSVTool.parse(...)
happily parsed that to a table with the single column "D",and 0
rows,

and

as there were 0 rows, the template didn't run into an error
because
row.myExpectedColumn refers to a missing column either, sothe
process
finished with success. (: Pretty unlucky for sure. The rootwasunintentionally breaking a FreeMarker idiom though;eventually we
will

have

to work on those too, but, different topic.)

However, actually multiple input documents can be passed in:

freemarker-cli
-t access-report.ftl
somewhere/foo-access-log.csv
somewhere/bar-access-log.csv
Above template will still work, though then you ignored allbut
the
first
document. So if you expect any number of input documents,you
probably

will

have to do this:

<#list Documents.list as doc>
... process doc here
</#list>

(The more idiomatic <#list Documents as doc> won't work; but
again,
those

we will work out in a different thread.)
So, what would be better, in my opinion. I start out fromwhat I
think

are
the common uses cases, in decreasing order of frequency.Goal is
to
make
those less error prone for the users, and simpler toexpress.
USE CASE 1
You have exactly 1 input documents, which is thereforesimply
"the"
document in the mind of the user. This is probably thetypical
use
case,
but at least the use case users typically start out fromwhen
starting

the

work.

freemarker-cli
-t access-report.ftl
somewhere/foo-access-log.csv
Then `Documents.get(0)` is not very fitting. Mostimportantly
it's
error
prone, because if the user passed in more than 1 documents(can
even
happen

totally accidentally, like if the user was lazy and used a
wildcard
that
the shell exploded), the template will silently ignore therest
of
the
documents, and the singe document processed will bepractically
picked
randomly. The user might won't notice that and submits a bad
report
or

such.

I think that in this use case the document should be simply
referred
as
`Document` in the template. When you have multiple documents
there,
referring to `Document` should be an error, saying that the
template
was

made to process a single document only.


USE CASE 2
You have multiple input documents, but each has differentrole
(different
schema, maybe different file type). Like, you pass inusers.csv
and
groups.csv. Each has difference schema, and so you want toaccess
them
differently, but in the same template.

freemarker-cli
[...]
--named-document users somewhere/foo-users.csv
--named-document groups somewhere/foo-groups.csv

Then in the template you could refer to them as:

`NamedDocuments.users`,

and `NamedDocuments.groups`.
Use Case 1, and 2 can be unified into a coherent concept,where
`Document`
is just a shorthand for `NamedDocuments.main`. It's called"main"
because
that's "the" document the template is about, but then youhave to
added
some helper documents, with symbolic names representingtheir
role.
freemarker-cli
-t access-report.ftl
--document-name=main somewhere/foo-access-log.csv
--document-name=users somewhere/foo-users.csv
--document-name=groups somewhere/foo-groups.csv
Here, `Document` still works in the template, and it refersto
`somewhere/foo-access-log.csv`. (While omitting
--document-name=main
above

would be cleaner, I couldn't figure out how to do that with
Picocli.
Anyway, for now the point is the concept, which is notspecific
to
CLI.)

USE CASE 3
Here you have several of the same kind of documents. Thathas a
more
generic sub-use-case, when you have explicitly nameddocuments
(like
"users" above), and for some you expect multiple inputfiles.
freemarker-cli
-t access-report.ftl
--document-name=main somewhere/foo-access-log.csv
somewhere/bar-access-log.csv
--document-name=users somewhere/foo-users.csv
somewhere/bar-users.csv
--document-name=groups somewhere/global-groups.csv
The template must to be written with this use case in mind,as
now
it
has
#list some of the documents. (I think in practice you hardlyever
want

to
get a document by hard coded index. Either you don't knowhow
many
documents you have, so you can't use hard coded indexes, oryou
do,
and
each index has a specific meaning, but then you should namethe
documents

instead, as using indexes is error prone, and hard to read.)
Accessing that list of documents in the template, maybecould be
done
like

this:
- For the "main" documents: `DocumentList`
- For explicitly named documents, like "users":

`NamedDocumentLists.users`

SUMMING UP

To unify all 3 use cases into a coherent concept:
- `NamedDocumentLists.<name>` is the most generic form, andwhile
you
can
achieve everything with it, using it requires your templateto
handle
the

most generic case too. So, I think it would be rarely used.
- `DocumentList` is just a shorthand for
`NamedDocumentLists.main`.
It's
used if you only have one kind of documents (single formatand
schema),

but

potentially multiple of them.
- `NamedDocuments.<name>` expresses that you expect exactly1
document

of

the given name.
- `Document` is just a shorthand for `NamedDocuments.main`.This
is
for

the

most natural/frequent use case.
That's 4 possible ways of accessing your documents, which isa
trade-off

for the sake of these:
- Catching CLI (or Maven, etc.) input where the templateoutput
likely

will
be wrong. That's only possible if the user can communicateits
intent
in

the template.
- Users don't need to deal with concepts that are irrelevantin
their
concrete use case. Just start with the trivial, `Document`,and
later
if
the need arises, generalize to named documents, documentlists,
or
both.

What do guys think?
--
Best regards,
Daniel Dekany
--
Best regards,
Daniel Dekany
--
Best regards,
Daniel Dekany
--
Best regards,
Daniel Dekany
--
Best regards,
Daniel Dekany

Re: freemarker-generator: Improving the input documents concept

Reply via email to