Re: [incubator-ponymail-foal] 01/02: Add a separate header for short bodies for stats.py

sebb Sun, 14 Nov 2021 15:00:47 -0800

On Sun, 14 Nov 2021 at 17:32, sebb <[email protected]> wrote:
>
> On Sun, 14 Nov 2021 at 16:22, Daniel Gruno <[email protected]> wrote:
> >
> > On 14/11/2021 12.19, sebb wrote:
> > > -1
> > >
> > > I think this commit should be reverted.
> > >
> > > I see no purpose for the body_short header.
> >
> > It is absolutely needed. It is what makes it possible to search large
> > lists with very large emails and not have everything crash around you.
>
> I see now.
>
> I was confused by the use of doc['body'] in messages.query and
> overlooked the fact that it does not download the body attribute.
>
> However, I'm not sure that BODY_MAXLEN serves any purpose.
>
> body_short could include the '...' marker.
>
> > When you only need the first bit of the text, there is no need to grab
> > several megabytes of email body, especially not if you have thousands of
> > emails you need to parse. In a production environment I have access to,
> > this caused a significant speedup and went from grabbing 2GB of data per
> > search to only 50MB - search time and overall backend response time was
> > an order of magnitude faster after this had been implemented. It is a
> > very vital component of a responsive service.
> >
> > The only other way would be to use scripting inside the ES query, which
> > is both dangerous and requires additional setup.
> >


On further reflection, I don't think the size of the body field is the
fundamental problem.

As I see it, the main problem is the number of emails that are
returned in a single response.

As the design is currently, even 200 bytes could become a problem if
there are enough emails in the search result.
And even if the server can cope, it can send a lot of data to the
client, which might be a low-memory phone.

At present the design just does not scale well, which is why there has
had to be this work-round.

I think the design needs to be reworked to use a batched response.
It would then not matter if the entire body were returned to
stats.lua, and that could shorten the data for display.

> > As for the length, 200 is the standard used throughout the interface.
> > Yes, you could change it, and yes you might need to reindex if so, but
> > who is realistically going to change it?
> >
>
> > >
> > > It wastes space in the database.
> > > It increases the data sent from database to server code.
> > >
> > > What if you want to change the length?
> > > Are you going to update the entire database each time the length is 
> > > changed?
> > >
> > > Seems to me the sensible way to do this is in the status.lua plugin.
> > > This can then pick up a config item for the length.
> > >
> > > Sebb
> > >
> > > On Sun, 17 Oct 2021 at 21:32, sebb <[email protected]> wrote:
> > >>
> > >> On Sun, 17 Oct 2021 at 16:43, <[email protected]> wrote:
> > >>>
> > >>> This is an automated email from the ASF dual-hosted git repository.
> > >>>
> > >>> humbedooh pushed a commit to branch master
> > >>> in repository 
> > >>> https://gitbox.apache.org/repos/asf/incubator-ponymail-foal.git
> > >>>
> > >>> commit 2dff9351d119ddee5c5e0171991c54b1911f05b1
> > >>> Author: Daniel Gruno <[email protected]>
> > >>> AuthorDate: Sun Oct 17 17:41:51 2021 +0200
> > >>>
> > >>>      Add a separate header for short bodies for stats.py
> > >>> ---
> > >>>   tools/archiver.py   | 5 ++++-
> > >>>   tools/mappings.yaml | 2 ++
> > >>>   2 files changed, 6 insertions(+), 1 deletion(-)
> > >>>
> > >>> diff --git a/tools/archiver.py b/tools/archiver.py
> > >>> index 1783d0e..dc12e39 100755
> > >>> --- a/tools/archiver.py
> > >>> +++ b/tools/archiver.py
> > >>> @@ -584,6 +584,8 @@ class Archiver(object):  # N.B. Also used by 
> > >>> import-mbox.py
> > >>>               ghash = hashlib.md5(mailaddr.encode("utf-8")).hexdigest()
> > >>>
> > >>>               notes.append(["ARCHIVE: Email archived as %s at %u" % 
> > >>> (document_id, time.time())])
> > >>> +            body_unflowed = body.unflow() if body else ""
> > >>> +            body_shortened = body_unflowed[:210]  # 210 so that we can 
> > >>> tell if > 200.
> > >>>
> > >>
> > >> -1
> > >>
> > >> What's so special about 200 and 210?
> > >>
> > >> These numbers should be constants (with suitable docn) or possibly 
> > >> configuration items.
> > >>
> > >> The only bare numbers I would expect to see in code are 0 and 1 (or -1).
> > >>
> > >>>               output_json = {
> > >>>                   "from_raw": msg_metadata["from"],
> > >>> @@ -603,7 +605,8 @@ class Archiver(object):  # N.B. Also used by 
> > >>> import-mbox.py
> > >>>                   "private": private,
> > >>>                   "references": msg_metadata["references"],
> > >>>                   "in-reply-to": irt,
> > >>> -                "body": body.unflow() if body else "",
> > >>> +                "body": body_unflowed,
> > >>> +                "body_short": body_shortened,
> > >>>                   "html_source_only": body and body.html_as_source or 
> > >>> False,
> > >>>                   "attachments": attachments,
> > >>>                   "forum": (lid or "").strip("<>").replace(".", "@", 1),
> > >>> diff --git a/tools/mappings.yaml b/tools/mappings.yaml
> > >>> index 4bb4978..6ad72d3 100644
> > >>> --- a/tools/mappings.yaml
> > >>> +++ b/tools/mappings.yaml
> > >>> @@ -55,6 +55,8 @@ mbox:
> > >>>             type: long
> > >>>       body:
> > >>>         type: text
> > >>> +    body_short:
> > >>> +      type: text
> > >>>       cc:
> > >>>         type: text
> > >>>       date:
> >

Re: [incubator-ponymail-foal] 01/02: Add a separate header for short bodies for stats.py

Reply via email to