Hi Daniel,
See my comments below - and thanks for your patience and input :-)
__Renaming Document To DataSource__
Yes, makes sense. I tried to avoid since I'm using `javax.activation`
and its `DataSource`.
__Template And Document Mode__
Agreed - I think it is a valuable abstraction for the user but it is not
an implementation concept :-)
__Document Without Symbolic Names__
Also agreed and it is going to change but I have not settled my mind yet
what exactly to implement.
Thanks in advance,
Siegfried Goeschl
On 28 Feb 2020, at 1:05, Daniel Dekany wrote:
A few quick thoughts on that:
- We should replace the "document" term with something more
speaking. It
doesn't tell that it's some kind of input. Also, most of these
inputs
aren't something that people typically call documents. Like a csv
file, or
a database table, which is not even a file (OK we don't support
such thing
at the moment). I think, maybe "data source" is a safe enough term.
(It
also rhymes with data model.)
- You have separate "template" and "document" "mode", that applies
to a
whole run. I think such specialization won't be helpful. We could
just say,
on the conceptual level at lest, that we need a set of "outputs
generators". An output generator is an object (in the API) that
specifies a
template, a data-model (where the data-model is possibly populated
with
"documents"), and an output "sink" (a file path, or stdout), and
can
generate the output itself. A practical way of defining the output
generators in a CLI application is via a bunch of files, each
defining an
output generator. Some of those files is maybe a template (that you
can
even detect from the file extension), or a data file that we
currently call
a "document". They could freely mix inside the same run. I have
also met
use case when you have a single table (single "document"), and each
record
in it yields an output file. That can also be described in some
file
format, or really in any other way, like directly in command line
argument,
via API, etc.
- You have multiple documents without associated symbolical name in
some
examples. Templates can't identify those then in a well
maintainable way.
The actual file name is often not a good identifier, can change
over time,
and you might don't even have good control over it, like you
already
receive it as a parameter from somewhere else, or someone
moves/renames
that files that you need to read. Index is also not very good, but
I have
written about that earlier.
On Wed, Feb 26, 2020 at 9:33 AM Siegfried Goeschl <
siegfried.goes...@gmail.com> wrote:
Hi folks,
still wrapping my side around but assembled some thoughts here -
https://gist.github.com/sgoeschl/b09b343a761b31a6c790d882167ff449
Thanks in advance,
Siegfried Goeschl
On 23 Feb 2020, at 23:14, Daniel Dekany <ddek...@apache.org> wrote:
What you are describing is more like the angle that FMPP took
initially,
where templates drive things, they generate the output for
themselves
(even
multiple output files if they wish). By default output files name
(and
relative path) is deduced from template name. There was also a
global
data-model, built in a configuration file (or equally, built via
command
line arguments, or both mixed), from which templates get whatever
data
they
are interested in. Take a look at the figures here:
http://fmpp.sourceforge.net/qtour.html. Later, this concept was
generalized
a bit more, because you could add XML files at the same place where
you
have the templates, and then you could associate transform templates
to
the
XML files (based on path pattern and/or the XML document element).
Now
that's like what freemarker-generator had initially (data files
drive
output, and the template is there to transform it).
So I think the generic mental model would like this:
1. You got files that drive the process, let's call them
*generator
files* for now. Usually, each generator file yields an output file
(but
maybe even multiple output files, as you might saw in the last
figure).
These generator files can be of many types, like XML, JSON, XLSX
(as
in the
original freemarker-generator), and even templates (as is the norm
in
FMPP). If the file is not a template, then you got a set of
transformer
templates (-t CLI option) in a separate directory, which can be
associated
with the generator files base on name patterns, and even based on
content
(schema usually). If the generator file is a template (so that's a
positional @Parameter CLI argument that happens to be an *.ftl,
and is
not
a template file specified after the "-t" option), then you just
Template.process(...) it, and it prints what the output will be.
2. You also have a set of variables, the global data-model, that
contains commonly useful stuff, like what you now call parameters
(CLI
-Pname=value), but also maybe data loaded from JSON, XML, etc..
Those
data
files aren't "generator files". Templates just use them if they
need
them.
An important thing here is to reuse the same mechanism to read and
parse
those data files, which was used in templates when transforming
generator
files. So we need a common format for specifying how to load data
files.
That's maybe just FTL that #assigns to the variables, or maybe
more
declarative format.
What I have described in the original post here was a less generic
form
of
this, as I tried to be true with the original approach. I though the
proposal will be drastic enough as it is... :) There, the "main"
document
is the "generator file" from point 1, the "-t" template is the
transform
template for the "main" document, and the other named documents
("users",
"groups") is a poor man's shared data-model from point 2 (together
with
with -PName=value).
There's further somewhat confusing thing to get right with the
list-of-documents (`DocuentList`, `NamedDocumentLists`) thing
though. In
the model above, as per point 1, if you list multiple data files,
each
will
generate a separate output file. So, if you need take in a list of
files
to
transform it to a single output file (or at least with a single
transform
template execution), then you have to be explicit about that, as
that's
not
the default behavior anymore. But it's still absolutely possible.
Imagine
it as a "list of XLSX-es" is itself like a file format. You need
some CLI
(and Maven config, etc.) syntax to express that, but that shouldn't
be a
big deal.
On Sun, Feb 23, 2020 at 9:43 PM Siegfried Goeschl <
siegfried.goes...@gmail.com> wrote:
Hi Daniel,
Good timing - I was looking at a similar problem from different
angle
yesterday (see below)
Don't have enough time to answer your email in detail now - will do
that
tomorrow evening
Thanks in advance,
Siegfried Goeschl
===. START
# FreeMarker CLI Improvement
## Support Of Multiple Template Files
Currently we support the following combinations
* Single template and no data files
* Single template and one or more data files
But we can not support the following use case which is quite
typical in
the cloud
__Convert multiple templates with a single data file, e.g copying a
directory of configuration files using a JSON configuration file__
## Implementation notes
* When we copy a directory we can remove the `ftl`extension on the
fly
* We might need an `exclude` filter for the copy operation
* Initially resolve to a list of template files and process one
after
another
* Need to calculate the output file location and extension
* We need to rename the existing command line parameters (see
below)
* Do we need multiple include and exclude filter?
* Do we need file versus directory filters?
### Command Line Options
```
--input-encoding : Encoding of the documents
--output-encoding : Encoding of the rendered template
--template-encoding : Encoding of the template
--output : Output file or directory
--include-document : Include pattern for documents
--exclude-document : Exclude pattern for documents
--include-template: Include pattern for templates
--exclude-template : Exclude pattern for templates
```
### Command Line Examples
```text
# Copy all FTL templates found in "ext/config" to the "/config"
directory
using the data from "config.json"
freemarker-cli -t ./ext/config --include-template *.ftl --o
/config
config.json
freemarker-cli --template ./ext/config --include-template *.ftl
--output
/config config.json
# Bascically the same using a named document "configuration"
# It might make sense to expose "conf" directly in the FreeMarker
data
model
# It might make sens to allow URIs for loading documents
freemarker-cli -t ./ext/config/*.ftl -o /config -d
configuration=config.json
freemarker-cli --template ./ext/config --include-template *.ftl
--output
/config --document configuration=config.json
freemarker-cli --template ./ext/config --include-template *.ftl
--output
/config --document configuration=file:///config.json
# Bascically the same using an environment variable as named
document
freemarker-cli -t ./ext/config --include-template *.ftl -o /config
-d
configuration=env:///CONFIGURATION
freemarker-cli --template ./ext/config --include-template *.ftl
--output
/config --document configuration=env:///CONFIGURATION
```
=== END
On 23.02.2020, at 16:37, Daniel Dekany <ddek...@apache.org> wrote:
Input documents is a fundamental concept in freemarker-generator,
so we
should think about that more, and probably refine/rework how it's
done.
Currently it works like this, with CLI at least.
freemarker-cli
-t access-report.ftl
somewhere/foo-access-log.csv
Then in access-report.ftl you have to do something like this:
<#assign doc = Documents.get(0)>
... process doc here
(The more idiomatic Documents[0] won't work. Actually, that lead
to a
funny
chain of coincidences: It returned the string "D", then
CSVTool.parse(...)
happily parsed that to a table with the single column "D", and 0
rows,
and
as there were 0 rows, the template didn't run into an error
because
row.myExpectedColumn refers to a missing column either, so the
process
finished with success. (: Pretty unlucky for sure. The root was
unintentionally breaking a FreeMarker idiom though; eventually we
will
have
to work on those too, but, different topic.)
However, actually multiple input documents can be passed in:
freemarker-cli
-t access-report.ftl
somewhere/foo-access-log.csv
somewhere/bar-access-log.csv
Above template will still work, though then you ignored all but
the
first
document. So if you expect any number of input documents, you
probably
will
have to do this:
<#list Documents.list as doc>
... process doc here
</#list>
(The more idiomatic <#list Documents as doc> won't work; but
again,
those
we will work out in a different thread.)
So, what would be better, in my opinion. I start out from what I
think
are
the common uses cases, in decreasing order of frequency. Goal is
to
make
those less error prone for the users, and simpler to express.
USE CASE 1
You have exactly 1 input documents, which is therefore simply
"the"
document in the mind of the user. This is probably the typical use
case,
but at least the use case users typically start out from when
starting
the
work.
freemarker-cli
-t access-report.ftl
somewhere/foo-access-log.csv
Then `Documents.get(0)` is not very fitting. Most importantly it's
error
prone, because if the user passed in more than 1 documents (can
even
happen
totally accidentally, like if the user was lazy and used a
wildcard
that
the shell exploded), the template will silently ignore the rest of
the
documents, and the singe document processed will be practically
picked
randomly. The user might won't notice that and submits a bad
report or
such.
I think that in this use case the document should be simply
referred as
`Document` in the template. When you have multiple documents
there,
referring to `Document` should be an error, saying that the
template
was
made to process a single document only.
USE CASE 2
You have multiple input documents, but each has different role
(different
schema, maybe different file type). Like, you pass in users.csv
and
groups.csv. Each has difference schema, and so you want to access
them
differently, but in the same template.
freemarker-cli
[...]
--named-document users somewhere/foo-users.csv
--named-document groups somewhere/foo-groups.csv
Then in the template you could refer to them as:
`NamedDocuments.users`,
and `NamedDocuments.groups`.
Use Case 1, and 2 can be unified into a coherent concept, where
`Document`
is just a shorthand for `NamedDocuments.main`. It's called "main"
because
that's "the" document the template is about, but then you have to
added
some helper documents, with symbolic names representing their
role.
freemarker-cli
-t access-report.ftl
--document-name=main somewhere/foo-access-log.csv
--document-name=users somewhere/foo-users.csv
--document-name=groups somewhere/foo-groups.csv
Here, `Document` still works in the template, and it refers to
`somewhere/foo-access-log.csv`. (While omitting
--document-name=main
above
would be cleaner, I couldn't figure out how to do that with
Picocli.
Anyway, for now the point is the concept, which is not specific to
CLI.)
USE CASE 3
Here you have several of the same kind of documents. That has a
more
generic sub-use-case, when you have explicitly named documents
(like
"users" above), and for some you expect multiple input files.
freemarker-cli
-t access-report.ftl
--document-name=main somewhere/foo-access-log.csv
somewhere/bar-access-log.csv
--document-name=users somewhere/foo-users.csv
somewhere/bar-users.csv
--document-name=groups somewhere/global-groups.csv
The template must to be written with this use case in mind, as now
it
has
#list some of the documents. (I think in practice you hardly ever
want
to
get a document by hard coded index. Either you don't know how many
documents you have, so you can't use hard coded indexes, or you
do, and
each index has a specific meaning, but then you should name the
documents
instead, as using indexes is error prone, and hard to read.)
Accessing that list of documents in the template, maybe could be
done
like
this:
- For the "main" documents: `DocumentList`
- For explicitly named documents, like "users":
`NamedDocumentLists.users`
SUMMING UP
To unify all 3 use cases into a coherent concept:
- `NamedDocumentLists.<name>` is the most generic form, and while
you
can
achieve everything with it, using it requires your template to
handle
the
most generic case too. So, I think it would be rarely used.
- `DocumentList` is just a shorthand for
`NamedDocumentLists.main`.
It's
used if you only have one kind of documents (single format and
schema),
but
potentially multiple of them.
- `NamedDocuments.<name>` expresses that you expect exactly 1
document
of
the given name.
- `Document` is just a shorthand for `NamedDocuments.main`. This
is for
the
most natural/frequent use case.
That's 4 possible ways of accessing your documents, which is a
trade-off
for the sake of these:
- Catching CLI (or Maven, etc.) input where the template output
likely
will
be wrong. That's only possible if the user can communicate its
intent
in
the template.
- Users don't need to deal with concepts that are irrelevant in
their
concrete use case. Just start with the trivial, `Document`, and
later
if
the need arises, generalize to named documents, document lists, or
both.
What do guys think?