Re: freemarker-generator: Improving the input documents concept

Siegfried Goeschl Fri, 28 Feb 2020 02:09:23 -0800

Hi Daniel,

That all depends on your mental model and work you do, expectations,experience :-)



__Document Handling__

*"But I think actually we have no good use case for list of documentsthat's passed at once to a single template run, so, we can just ignorethat complication"*

In my case that's not a complication but my daily business - I'mregularly wading through access logs - yesterday probably a couple ofhundreds access logs across two staging sites to help tracking somestrange API gateway issues :-)

My gut feeling is (borrowing fromhttps://medium.com/@Joachim8675309/devops-concepts-pets-vs-cattle-2380b5aab313)


1. You have a few lovely named documents / templates - `pets`

2. You have tons of anonymous documents / templates to process -`cattle`

3. The "grey area" comes into play when mixing `pets & cattle`

`freemarker-cli` was built with 2) in mind and I want to cover 1) sinceit is equally important and common.



__Template And Document Processing Modes__

IMHO it is important to answer the following question : "How manyoutputs do you get when rendering 2 template and 3 datasources? Two,Three or Six?"


Your answer is influenced by your mental model / experience

* When wading through tons of CSV files, access logs, etc. the answer is"2"

* When doing source code generation the obvious answer is "6"

* Can't image a use case which results in "3" but I'm pretty sure wewill encounter one


__Template and document mode probably shouldn't exist__

That's hard for me to fully understand - I definitely lack your insights& experience writing such tools :-)

Defining the `Output Generator` is the underlying model for the Mavenplugin (and probably FMPP).

I'm not sure if this applies for command lines at least not in the way Iuse them (or would like to use them)



Thanks in advance,

Siegfried Goeschl

PS: Can/shall I merge the PR to bring in `freemarker-cli`?


On 28 Feb 2020, at 9:14, Daniel Dekany wrote:

Yeah, "data source" is surely a too popular name, but for reason.Anyone
has other ideas?
As of naming data sources and such. One thing I was wondering aboutback
then is how to deal with list of documents given to a template, versus
exactly 1 document given to a template. But I think actually we haveno
good use case for list of documents that's passed at once to a single
template run, so, we can just ignore that complication. A document hasaname, and that's always just a single document, not a collection, asfar asthe template is concerned. (We can have multiple documents per run,butthose normally yield separate output generators, so it's still onlyonedocument per template.) However, we can have data source types(documenttypes with old terminology) that collect together multiple data files.Sothen that complexity is encapsulated into the data source type, anddoesn'tcomplicate the overall architecture. That's another case when a datasourceis not just a file. Like maybe there's a data source type that loadsallthe CSV-s from a directory, into a single big table (I had such case),oreven into a list of tables. Or, as I mentioned already, a data sourceis
maybe an SQL query on a JDBC data source (and we got the first term
clash... JDBC also call them data sources).
Template and document mode probably shouldn't exist from userperspectiveeither, at least not as a global option that must apply to everythingin arun. They could just give the files that define the "outputgenerators",and some of them will be templates, some of them are data files, inwhichcase a template need to be associated with them (and there can be acoupleof ways of doing that). And then again, there are the cases where youwant
to create one output generator per entity from some data source.

On Fri, Feb 28, 2020 at 8:23 AM Siegfried Goeschl <
[email protected]> wrote:
Hi Daniel,

See my comments below - and thanks for your patience and input :-)

*Renaming Document To DataSource*
Yes, makes sense. I tried to avoid since I'm using javax.activationand
its DataSource.

*Template And Document Mode*
Agreed - I think it is a valuable abstraction for the user but it isnot
an implementation concept :-)

*Document Without Symbolic Names*
Also agreed and it is going to change but I have not settled my mindyet
what exactly to implement.

Thanks in advance,

Siegfried Goeschl

On 28 Feb 2020, at 1:05, Daniel Dekany wrote:

A few quick thoughts on that:
- We should replace the "document" term with something more speaking.It
doesn't tell that it's some kind of input. Also, most of these inputs
aren't something that people typically call documents. Like a csvfile, ora database table, which is not even a file (OK we don't support suchthingat the moment). I think, maybe "data source" is a safe enough term.(It
also rhymes with data model.)
- You have separate "template" and "document" "mode", that applies toawhole run. I think such specialization won't be helpful. We couldjust say,
on the conceptual level at lest, that we need a set of "outputs
generators". An output generator is an object (in the API) thatspecifies atemplate, a data-model (where the data-model is possibly populatedwith
"documents"), and an output "sink" (a file path, or stdout), and can
generate the output itself. A practical way of defining the output
generators in a CLI application is via a bunch of files, eachdefining anoutput generator. Some of those files is maybe a template (that youcaneven detect from the file extension), or a data file that wecurrently calla "document". They could freely mix inside the same run. I have alsometuse case when you have a single table (single "document"), and eachrecord
in it yields an output file. That can also be described in some file
format, or really in any other way, like directly in command lineargument,
via API, etc.
- You have multiple documents without associated symbolical name insomeexamples. Templates can't identify those then in a well maintainableway.The actual file name is often not a good identifier, can change overtime,
and you might don't even have good control over it, like you already
receive it as a parameter from somewhere else, or someonemoves/renamesthat files that you need to read. Index is also not very good, but Ihave
written about that earlier.


On Wed, Feb 26, 2020 at 9:33 AM Siegfried Goeschl <
[email protected]> wrote:

Hi folks,

still wrapping my side around but assembled some thoughts here -
https://gist.github.com/sgoeschl/b09b343a761b31a6c790d882167ff449

Thanks in advance,

Siegfried Goeschl


On 23 Feb 2020, at 23:14, Daniel Dekany <[email protected]> wrote:
What you are describing is more like the angle that FMPP tookinitially,
where templates drive things, they generate the output for themselves

(even
multiple output files if they wish). By default output files name(and
relative path) is deduced from template name. There was also a global
data-model, built in a configuration file (or equally, built viacommandline arguments, or both mixed), from which templates get whateverdata
they

are interested in. Take a look at the figures here:
http://fmpp.sourceforge.net/qtour.html. Later, this concept was

generalized
a bit more, because you could add XML files at the same place whereyouhave the templates, and then you could associate transform templatesto
the
XML files (based on path pattern and/or the XML document element).Now
that's like what freemarker-generator had initially (data files drive
output, and the template is there to transform it).

So I think the generic mental model would like this:

1. You got files that drive the process, let's call them *generator
files* for now. Usually, each generator file yields an output file(butmaybe even multiple output files, as you might saw in the lastfigure).
These generator files can be of many types, like XML, JSON, XLSX (as

in the

original freemarker-generator), and even templates (as is the norm in
FMPP). If the file is not a template, then you got a set oftransformer
templates (-t CLI option) in a separate directory, which can be

associated

with the generator files base on name patterns, and even based on

content

(schema usually). If the generator file is a template (so that's a
positional @Parameter CLI argument that happens to be an *.ftl, andis
not

a template file specified after the "-t" option), then you just
Template.process(...) it, and it prints what the output will be.
2. You also have a set of variables, the global data-model, that
contains commonly useful stuff, like what you now call parameters(CLI
-Pname=value), but also maybe data loaded from JSON, XML, etc.. Those

data

files aren't "generator files". Templates just use them if they need

them.

An important thing here is to reuse the same mechanism to read and

parse

those data files, which was used in templates when transforming

generator

files. So we need a common format for specifying how to load data

files.

That's maybe just FTL that #assigns to the variables, or maybe more
declarative format.
What I have described in the original post here was a less genericform
of

this, as I tried to be true with the original approach. I though the
proposal will be drastic enough as it is... :) There, the "main"documentis the "generator file" from point 1, the "-t" template is thetransformtemplate for the "main" document, and the other named documents("users","groups") is a poor man's shared data-model from point 2 (togetherwith
with -PName=value).

There's further somewhat confusing thing to get right with the
list-of-documents (`DocuentList`, `NamedDocumentLists`) thing though.Inthe model above, as per point 1, if you list multiple data files,each
will
generate a separate output file. So, if you need take in a list offiles
to
transform it to a single output file (or at least with a singletransformtemplate execution), then you have to be explicit about that, asthat's
not
the default behavior anymore. But it's still absolutely possible.Imagineit as a "list of XLSX-es" is itself like a file format. You need someCLI(and Maven config, etc.) syntax to express that, but that shouldn'tbe a
big deal.



On Sun, Feb 23, 2020 at 9:43 PM Siegfried Goeschl <
[email protected]> wrote:

Hi Daniel,

Good timing - I was looking at a similar problem from different angle
yesterday (see below)
Don't have enough time to answer your email in detail now - will dothat
tomorrow evening

Thanks in advance,

Siegfried Goeschl


===. START
# FreeMarker CLI Improvement
## Support Of Multiple Template Files
Currently we support the following combinations

* Single template and no data files
* Single template and one or more data files
But we can not support the following use case which is quite typicalin
the cloud

__Convert multiple templates with a single data file, e.g copying a
directory of configuration files using a JSON configuration file__

## Implementation notes
* When we copy a directory we can remove the `ftl`extension on thefly
* We might need an `exclude` filter for the copy operation
* Initially resolve to a list of template files and process one after
another
* Need to calculate the output file location and extension
* We need to rename the existing command line parameters (see below)
* Do we need multiple include and exclude filter?
* Do we need file versus directory filters?

### Command Line Options
```
--input-encoding : Encoding of the documents
--output-encoding : Encoding of the rendered template
--template-encoding : Encoding of the template
--output : Output file or directory
--include-document : Include pattern for documents
--exclude-document : Exclude pattern for documents
--include-template: Include pattern for templates
--exclude-template : Exclude pattern for templates
```

### Command Line Examples
```text
# Copy all FTL templates found in "ext/config" to the "/config"

directory

using the data from "config.json"

freemarker-cli -t ./ext/config --include-template *.ftl --o /config

config.json

freemarker-cli --template ./ext/config --include-template *.ftl

--output

/config config.json

# Bascically the same using a named document "configuration"
# It might make sense to expose "conf" directly in the FreeMarkerdata
model
# It might make sens to allow URIs for loading documents

freemarker-cli -t ./ext/config/*.ftl -o /config -d

configuration=config.json

freemarker-cli --template ./ext/config --include-template *.ftl

--output

/config --document configuration=config.json

freemarker-cli --template ./ext/config --include-template *.ftl

--output

/config --document configuration=file:///config.json

# Bascically the same using an environment variable as named document

freemarker-cli -t ./ext/config --include-template *.ftl -o /config -d

configuration=env:///CONFIGURATION

freemarker-cli --template ./ext/config --include-template *.ftl

--output

/config --document configuration=env:///CONFIGURATION
```
=== END

On 23.02.2020, at 16:37, Daniel Dekany <[email protected]> wrote:
Input documents is a fundamental concept in freemarker-generator, soweshould think about that more, and probably refine/rework how it'sdone.
Currently it works like this, with CLI at least.

freemarker-cli
-t access-report.ftl
somewhere/foo-access-log.csv

Then in access-report.ftl you have to do something like this:

<#assign doc = Documents.get(0)>
... process doc here

(The more idiomatic Documents[0] won't work. Actually, that lead to a

funny

chain of coincidences: It returned the string "D", then

CSVTool.parse(...)
happily parsed that to a table with the single column "D", and 0rows,
and

as there were 0 rows, the template didn't run into an error because
row.myExpectedColumn refers to a missing column either, so theprocess
finished with success. (: Pretty unlucky for sure. The root was
unintentionally breaking a FreeMarker idiom though; eventually wewill
have

to work on those too, but, different topic.)

However, actually multiple input documents can be passed in:

freemarker-cli
-t access-report.ftl
somewhere/foo-access-log.csv
somewhere/bar-access-log.csv

Above template will still work, though then you ignored all but the

first
document. So if you expect any number of input documents, youprobably
will

have to do this:

<#list Documents.list as doc>
... process doc here
</#list>

(The more idiomatic <#list Documents as doc> won't work; but again,

those

we will work out in a different thread.)
So, what would be better, in my opinion. I start out from what Ithink
are

the common uses cases, in decreasing order of frequency. Goal is to

make

those less error prone for the users, and simpler to express.

USE CASE 1

You have exactly 1 input documents, which is therefore simply "the"
document in the mind of the user. This is probably the typical use

case,
but at least the use case users typically start out from whenstarting
the

work.

freemarker-cli
-t access-report.ftl
somewhere/foo-access-log.csv

Then `Documents.get(0)` is not very fitting. Most importantly it's

error

prone, because if the user passed in more than 1 documents (can even

happen

totally accidentally, like if the user was lazy and used a wildcard

that
the shell exploded), the template will silently ignore the rest ofthedocuments, and the singe document processed will be practicallypickedrandomly. The user might won't notice that and submits a bad reportor
such.
I think that in this use case the document should be simply referredas
`Document` in the template. When you have multiple documents there,
referring to `Document` should be an error, saying that the template

was

made to process a single document only.


USE CASE 2

You have multiple input documents, but each has different role

(different

schema, maybe different file type). Like, you pass in users.csv and
groups.csv. Each has difference schema, and so you want to accessthem
differently, but in the same template.

freemarker-cli
[...]
--named-document users somewhere/foo-users.csv
--named-document groups somewhere/foo-groups.csv

Then in the template you could refer to them as:

`NamedDocuments.users`,

and `NamedDocuments.groups`.

Use Case 1, and 2 can be unified into a coherent concept, where

`Document`

is just a shorthand for `NamedDocuments.main`. It's called "main"

because
that's "the" document the template is about, but then you have toadded
some helper documents, with symbolic names representing their role.

freemarker-cli
-t access-report.ftl
--document-name=main somewhere/foo-access-log.csv
--document-name=users somewhere/foo-users.csv
--document-name=groups somewhere/foo-groups.csv

Here, `Document` still works in the template, and it refers to
`somewhere/foo-access-log.csv`. (While omitting --document-name=main

above

would be cleaner, I couldn't figure out how to do that with Picocli.
Anyway, for now the point is the concept, which is not specific to

CLI.)

USE CASE 3

Here you have several of the same kind of documents. That has a more
generic sub-use-case, when you have explicitly named documents (like
"users" above), and for some you expect multiple input files.

freemarker-cli
-t access-report.ftl
--document-name=main somewhere/foo-access-log.csv
somewhere/bar-access-log.csv
--document-name=users somewhere/foo-users.csv
somewhere/bar-users.csv
--document-name=groups somewhere/global-groups.csv

The template must to be written with this use case in mind, as now it

has
#list some of the documents. (I think in practice you hardly everwant
to

get a document by hard coded index. Either you don't know how many
documents you have, so you can't use hard coded indexes, or you do,and
each index has a specific meaning, but then you should name the

documents

instead, as using indexes is error prone, and hard to read.)
Accessing that list of documents in the template, maybe could be done

like

this:
- For the "main" documents: `DocumentList`
- For explicitly named documents, like "users":

`NamedDocumentLists.users`

SUMMING UP

To unify all 3 use cases into a coherent concept:
- `NamedDocumentLists.<name>` is the most generic form, and while you

can

achieve everything with it, using it requires your template to handle

the

most generic case too. So, I think it would be rarely used.
- `DocumentList` is just a shorthand for `NamedDocumentLists.main`.

It's
used if you only have one kind of documents (single format andschema),
but

potentially multiple of them.
- `NamedDocuments.<name>` expresses that you expect exactly 1document
of

the given name.
- `Document` is just a shorthand for `NamedDocuments.main`. This isfor
the

most natural/frequent use case.

That's 4 possible ways of accessing your documents, which is a

trade-off

for the sake of these:
- Catching CLI (or Maven, etc.) input where the template outputlikely
will

be wrong. That's only possible if the user can communicate its intent

in

the template.
- Users don't need to deal with concepts that are irrelevant in their
concrete use case. Just start with the trivial, `Document`, and later

if

the need arises, generalize to named documents, document lists, or

both.

What do guys think?

Re: freemarker-generator: Improving the input documents concept

Reply via email to