I have a few opinions to offer.

First, what is this data profile for? I know of only a few things one would
use that sort of profile for, and those could all be handled by looking at
the first Kbyte of the data.
(Guessing natural language, guessing charset, LF vs. CRLF, and guessing
text vs. binary data, guessing binary integers vs. packed-decimal integers,
compressed/not), if you randomly select another kbyte of the file and you
get a different profile result, that's also perhaps interesting because it
tells you the data is not consistent throughout the file. But all this is
just to guess a few top-level data characteristics. I really don't know
anything that requires a byte histogram over an entire file, so I wouldn't
even suggest having that operation, or default it to 1 kbyte of data and
let the user enlarge it.

I agree that you can't just treat binary data files like text editing. To
me, I think Search and Search/Replace are very unlikely to be used in bulk
on binary data files. I deal with mostly binary data and there I think
there's so many opportunities for false matches that bulk operations
(replace all, or search all to get a count) make little or no sense.
Textual data is a different story. That's just big-data text editing. But
many data formats, even mostly-textual ones, have stored length
information. In those cases, editing the data in any way which changes the
number of bytes is going to immediately break the whole file, so
insert/delete makes basically no sense. I think an editor needs to have a
mode where you can't accidently insert nor delete data. The user could
modify data only, not insert/delete data. Once a user knows their data is
of a kind that has stored lengths, they would want to toggle this on, so as
to avoid corrupting the entire file by simple accident. Arguably, data
editors should default to this mode, and require users to say it is ok to
insert/delete bytes in order to access those features. Once a user sets
these options, remembering them in some sort of sticky settings saved in
the project folders is helpful.

For data files of any size, doing a total search that tells you how many
total matches there are to a pattern, at least for binary data, is fairly
meaningless so I would suggest that attempting to display both the first
few matches, and a count of how many total matches is not necessary. Just
show the first match, and prefetch the second (or maybe a few) so the user
gets some context, but I would not bother to do anything beyond that. The
count of how many is just not that useful in binary data. Text data maybe,
but binary data I don't see the use case.

Last thought. Often there is not one big data file, but a directory (or
several) full of smaller data files. Operations on data should be able to
span files or operate on a single file in a fairly transparent manner.
There's very little conceptual difference between a file of binary records,
and a directory of files each containing 1 binary record, so a data editing
environment should treat these as roughly equivalent. So searching for a
particular byte pattern or bit pattern in one file, or across a directory
of files, and moving from one match to the next, should be more or less the
same operation to the user.

On Thu, Jul 27, 2023 at 9:41 AM Davin Shearer <da...@apache.org> wrote:

> In v1.3.1 we've added support for editing large files, but it has
> exposed some other challenges related to search, replace, and data
> profiling.  I outline the problems and possible solutions to these problems
> in a discussion thread here (
> https://github.com/ctc-oss/daffodil-vscode/discussions/122).
>
> The bottom line up front is that for search and replace, I think we'll need
> to adopt an interactive approach rather than an all at once approach.
> For example search will find the next match from where you are, click next
> and it will find the next, and so on, instead of finding all the matches up
> front.  Similarly, with replace, we find the next match, then you can
> either replace or skip to the next match, and so on.  These are departures
> from v1.3.0, but we need something that will scale.
>
> Data profiling is a new feature in v1.3.1 that creates a byte frequency
> graph and some statistics on all or part of the edited file.  Right now
> I've allowed it to profile from the beginning to the end of the file, even
> if the file is multiple gigabytes in size.  Currently though that could
> take longer than 5 seconds especially if the file has many editing
> changes.  After 5 seconds the request is timed out in the Scala gRPC
> server.  I can bump up the time out, but that's just a band aid (what
> happens if someone wants to profile 1+TB file, for example).  I think a
> reasonable fix is to allow the user to select any offset in the file and we
> profile up to X bytes from that offset, where X is perhaps something on the
> order of 1M.  This ensures the UI is responsive and can scale.
>
> We expect to have a release candidate of v1.3.1 within two weeks from now,
> and I'm hoping to address these scale issues before then.  Feedback
> welcome!
>
> Thank you.
>

Reply via email to