Re: Release Lucene/Solr 8.9.0 should we have it soon

2021-05-12 Thread Peter Gromov
I can try backporting Hunspell improvements (
https://issues.apache.org/jira/browse/LUCENE-9687) to 8.x, unless you think
it's too large a change.

On Tue, May 11, 2021 at 7:29 PM Mayya Sharipova
 wrote:

> Thanks everyone,
>
> Adrien, I  am happy to try to be a release manager for this release.
>
> Adrien, and Gus, please let me know when your changes are merged to 8.x
>
>
>
> On Tue, May 11, 2021 at 10:38 AM Gus Heck  wrote:
>
>> I'm also looking to find time to get
>> https://issues.apache.org/jira/browse/SOLR-14597 into some sort of 8x.
>> I've recently completed the back port of 2/3 of the lucene tickets that are
>> related, and hope to work on the third tomorrow
>>
>> I had some feedback there, but I think folks were waiting for the version
>> integrated with the final form of the Lucene tickets before delving
>> further. Hopefully this week I can start on a patch that does that.
>>
>> On Tue, May 11, 2021 at 10:25 AM Adrien Grand  wrote:
>>
>>> I would like to backport LUCENE-9827
>>>  before we release
>>> 8.9, a performance regression to stored fields merges. I'll work on this as
>>> soon as possible.
>>>
>>> On Thu, May 6, 2021 at 10:28 PM Adrien Grand  wrote:
>>>
 +1

 Mayya, are you volunteering to be the release manager?

 Le jeu. 6 mai 2021 à 18:06, Ishan Chattopadhyaya <
 ichattopadhy...@gmail.com> a écrit :

> +1
>
> On Thu, May 6, 2021 at 7:50 PM Mayya Sharipova
>  wrote:
>
>> Hello everyone,
>> I was wondering if we can have a 8.9.0 release. It has been more than
>> 3 months since 8.8.0 was released.
>> 8.9.0 doesn't need to be the last release in the 8.x series.
>>
>> Thanks.
>>
>
>>>
>>> --
>>> Adrien
>>>
>>
>>
>> --
>> http://www.needhamsoftware.com (work)
>> http://www.the111shift.com (play)
>>
>


Re: Welcome Peter Gromov as Lucene committer

2021-04-07 Thread Peter Gromov
Thanks, that helped!

On Wed, Apr 7, 2021 at 11:23 AM Dawid Weiss  wrote:

> See here -
> https://git.apache.org/setup/
>
> On Wed, Apr 7, 2021 at 10:52 AM Peter Gromov
>  wrote:
> >
> > I have 2FA, but nobody added me to the ASF organization. How does one do
> that?
> >
> > On Wed, Apr 7, 2021 at 10:09 AM Atri Sharma  wrote:
> >>
> >> Did you set your 2FA up and get added to the Apache Software
> >> Foundation Github org?
> >>
> >> On Wed, Apr 7, 2021 at 12:42 PM Peter Gromov
> >>  wrote:
> >> >
> >> > Thanks for the honor!
> >> >
> >> > (BTW I'm still not recognized by Github as having write access, and
> can't merge my pull requests :))
> >> >
> >> > > Peter, the tradition is that new committers introduce themselves
> with a brief bio.
> >> >
> >> > Okay, time for some bragging :) I've been working at JetBrains for
> some 17 years, most of them on IntelliJ platform, mainly supporting various
> languages and their infrastructure, analyzing snapshots and improving
> performance. Aiming to catch more bugs before they hit production, I've
> introduced property-based testing to IntelliJ by creating a small library
> called jetCheck. Recently I've switched to the Grazie project and now I do
> some rule-based computational linguistics there and enhance the IDE support
> for English. As Grazie needs LanguageTool and Hunspell, I've also spent
> some time rewriting the latter in Java (here in Lucene), and optimizing
> them both. In my free time, I like mountain hiking (Munich/Germany is a
> great location for that!), and some amateur piano/harmonica playing/singing.
> >>
> >> --
> >> Regards,
> >>
> >> Atri
> >> Apache Concerted
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: Welcome Peter Gromov as Lucene committer

2021-04-07 Thread Peter Gromov
I have 2FA, but nobody added me to the ASF organization. How does one do
that?

On Wed, Apr 7, 2021 at 10:09 AM Atri Sharma  wrote:

> Did you set your 2FA up and get added to the Apache Software
> Foundation Github org?
>
> On Wed, Apr 7, 2021 at 12:42 PM Peter Gromov
>  wrote:
> >
> > Thanks for the honor!
> >
> > (BTW I'm still not recognized by Github as having write access, and
> can't merge my pull requests :))
> >
> > > Peter, the tradition is that new committers introduce themselves with
> a brief bio.
> >
> > Okay, time for some bragging :) I've been working at JetBrains for some
> 17 years, most of them on IntelliJ platform, mainly supporting various
> languages and their infrastructure, analyzing snapshots and improving
> performance. Aiming to catch more bugs before they hit production, I've
> introduced property-based testing to IntelliJ by creating a small library
> called jetCheck. Recently I've switched to the Grazie project and now I do
> some rule-based computational linguistics there and enhance the IDE support
> for English. As Grazie needs LanguageTool and Hunspell, I've also spent
> some time rewriting the latter in Java (here in Lucene), and optimizing
> them both. In my free time, I like mountain hiking (Munich/Germany is a
> great location for that!), and some amateur piano/harmonica playing/singing.
>
> --
> Regards,
>
> Atri
> Apache Concerted
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: Welcome Peter Gromov as Lucene committer

2021-04-07 Thread Peter Gromov
Thanks for the honor!

(BTW I'm still not recognized by Github as having write access, and can't
merge my pull requests :))

> Peter, the tradition is that new committers introduce themselves with a
brief bio.

Okay, time for some bragging :) I've been working at JetBrains for some 17
years, most of them on IntelliJ platform ,
mainly supporting various languages and their infrastructure, analyzing
snapshots and improving performance. Aiming to catch more bugs before they
hit production, I've introduced property-based testing to IntelliJ by
creating a small library called jetCheck
. Recently I've switched to the
Grazie  project and now
I do some rule-based computational linguistics there and enhance the IDE
support for English. As Grazie needs LanguageTool
 and Hunspell, I've also spent some time
rewriting the latter in Java (here in Lucene), and optimizing them both. In
my free time, I like mountain hiking (Munich/Germany is a great location
for that!), and some amateur piano/harmonica playing/singing
.

>


Re: Hunspell performance

2021-02-12 Thread Peter Gromov
Dawid, I didn't notice the commit link then, thanks for pointing that out!

This "// TODO: make sure these returned charsref are immutable?" is a good
point, because now they're very mutable, referring to internal preallocated
buffers in Stemmer which are constantly reused.

In cache-all condition, you ignore the maxSize intentionally, right?

I've reproduced your results for English. I also checked German and French,
which have compounds and more advanced inflection. They're improved as
well, but not so much (30-40% on cache=1, while calling native Hunspell
via JNI is 2-4 times faster).

I dream of making Hunspell Stemmer thread-safe, and have even got rid of
some preallocated stuff there, but there still remains some, so in the near
future it'll remain thread-unsafe and caching can fit in there.

Clients might deduplicate the requests themselves, I've done that a couple
of times. Then the cache inside Hunspell would be useless and just add some
overhead (luckily, not much, as per my CPU snapshots).

Robert, for n=20 the speedup is quite small, 2-8% for me depending on the
language. Unfortunately Hunspell dictionaries don't have stop word
information, it'd be quite useful.

On Fri, Feb 12, 2021 at 12:56 PM Robert Muir  wrote:

>
> On Fri, Feb 12, 2021 at 4:01 AM Dawid Weiss  wrote:
>
>>
>> It's all an intellectual exercise though. I consider the initial
>> performance drop on small cache windows a nice, cheap, win. Increasing
>> the cache leads to other problems that may sting later (gc activity,
>> memory consumption on multiple parallel threads, etc.).
>>
>>
> Is that because of stopwords that aren't being removed? I guess what I'm
> asking is, for this test is n=20 enough? :)
>
> If so, leads to many other potential solutions (without caches). Also it
> would suggest the benchmark might be a bit too biased.
>


Re: Hunspell performance

2021-02-11 Thread Peter Gromov
Dawid,

I like these numbers very much! Did you put the caching inside
Dictionary#lookupWord? Did you cache negative results as well?

Peter

On Thu, Feb 11, 2021 at 12:17 PM Dawid Weiss  wrote:

> That's an interesting idea about storing shortest (easy) or most frequent
>> (hard) words separately. Unfortunately the distribution isn't entirely
>> zipfian, as the dictionaries tend to contain a lot of short but uncommon
>> words, like abbreviations. Still, something for me to explore, thanks!
>>
>
> I meant taking advantage of empirical distribution of words occurring in
> an average text (not their lengths) and providing a faster code path for
> these.
> I'd assume a typical run would spend the majority of the time looking up
> the same words over and over. A small cache could be added to the
> dictionary but also anywhere else - even on top of Stemmer or SpellChecker.
> This cache could be very, very simplistic - even a dumb throw-away HashMap
> discarded after it reaches a certain size (cache hits would occur only
> until it's filled up). The distribution of the input still will make it
> perform relatively well.
>
> Here's an example - I used Mike McCandless's enwiki and ran English test
> from TestPerformance on it. My master branch results:
>
> Loaded c:\_tmp\hunspell\libreoffice\en\en_AU.aff
> Stemming en: average 1958.7142857142858, all times = [1968, 1960, 1945,
> 1955, 1959, 1972, 1952]
> Spellchecking en: average 2438.1428571428573, all times = [2465, 2425,
> 2429, 2434, 2452, 2429, 2433]
>
> I then added [1] a dumb no-eviction-policy caching on top of stemming and
> spell checking.
>
> Results for cache size of 2500 entries:
> Stemming en: average 886.5714285714286, all times = [885, 885, 889, 889,
> 890, 886, 882]
> Spellchecking en: average 1182.0, all times = [1207, 1177, 1184, 1187,
> 1179, 1170, 1170]
>
> Results for cache size of 5000 entries:
> Stemming en: average 762.8571428571429, all times = [764, 758, 756, 766,
> 763, 766, 767]
> Spellchecking en: average 1065.2857142857142, all times = [1070, 1068,
> 1061, 1065, 1064, 1067, 1062]
>
> Results for cache size of 1 entries:
> Stemming en: average 628.0, all times = [626, 628, 627, 624, 627, 634, 630]
> Spellchecking en: average 926.8571428571429, all times = [924, 925, 930,
> 927, 925, 930, 927]
>
> The sizes of these caches are literally nothing and they don't require any
> pretraining or eviction policies. This is all I meant by "hybrid" lookup
> based on the distribution... this could be built in the dictionary but
> doesn't have to be.
>
> D.
>
> [1]
> https://github.com/apache/lucene-solr/commit/2b3cc7d1fc4ee3d999873f056764f70129390121
>
>


Re: Hunspell performance

2021-02-11 Thread Peter Gromov
> To me the challenge with such a change is just trying to prevent
strange dictionaries from blowing up to 30x the space :)

Checked that. And indeed, such dictionaries exist, 20x and even 30x, and
then they start taking up to 30MB. Not nice.

> if you assume zipfian distribution of words, the top common ones could be
stored/ cached outside of the fst (even in an associative dictionary). This
would require external frequency information during construction but this
isn't something difficult.

That's an interesting idea about storing shortest (easy) or most frequent
(hard) words separately. Unfortunately the distribution isn't entirely
zipfian, as the dictionaries tend to contain a lot of short but uncommon
words, like abbreviations. Still, something for me to explore, thanks!

As for configurability, I'm considering that option as well (but would
prefer to avoid it). It's not as easy as adding a facade around one
"lookup" method. Well, now it is, but if we're staying with FST, then I'd
better finish that arc caching optimization I described (some 20-30%
speedup, not bad), and that'd require changing multiple signatures to pass
around some arc cache info in addition to the simple char[].

On Thu, Feb 11, 2021 at 9:09 AM Dawid Weiss  wrote:

>
> I peeked at the code and I still think it's not a bad idea to experiment
> with extracting a facade for construction and lookup of words. there may
> even be a middle ground between size and speed - if you assume zipfian
> distribution of words, the top common ones could be stored/ cached outside
> of the fst (even in an associative dictionary). This would require external
> frequency information during construction but this isn't something
> difficult.
>
> D.
>
> On Thu, Feb 11, 2021 at 8:54 AM Dawid Weiss  wrote:
>
>>
>> I didn't mean for Peter to write both backends but perhaps, if he's
>> experimenting already anyway, make it possible to extract an interface
>> which could be substituted externally with different implementations. Makes
>> it easier to tinker with various options, even for us.
>>
>> D.
>>
>> On Thu, Feb 11, 2021 at 1:16 AM Robert Muir  wrote:
>>
>>> On Wed, Feb 10, 2021 at 3:05 PM Dawid Weiss 
>>> wrote:
>>> > Maybe the "backend" could be configurable somehow so that you could
>>> change the strategy depending on your needs?... I haven't looked at how
>>> FSTs are used but if can be hidden behind a facade then an alternative
>>> implementation could be provided depending on one's need?
>>> >
>>> > D.
>>> >
>>>
>>> I don't have any confidence that solr would default to the "smaller"
>>> option or fix how they manage different solr cores or thousands of
>>> threads or any of the analyzer issues. And who would maintain this
>>> separate hunspell backend? I don't think it is fair to Peter to have
>>> to cope with 2 implementations of hunspell, 1 is certainly enough...
>>> :). It's all apache license, at the end of the day if someone wants to
>>> step up, let 'em. otherwise let's get out of their way.
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>>>


Re: Hunspell performance

2021-02-10 Thread Peter Gromov
I was hoping for some numbers :) In the meantime, I've got some of my own.
I loaded 90 dictionaries from https://github.com/wooorm/dictionaries
(there's more, but I ignored dialects of the same base language). Together
they currently consume a humble 166MB. With one of my less memory-hungry
approaches, they'd take ~500MB (maybe less if I optimize, but probably not
significantly). Is this very bad or tolerable for, say, 50% speedup?

I've seen huge *.aff files, and I'm planning to do something with affix
FSTs, too. They take some noticeable time, too, but much less than *.dic-s
one, so for now I concentrate on *.dic.

> Sure, but 20% of those linear scans are maybe 7x slower

Checked that. The distribution appears to be decreasing monotonically. No
linear scans are longer than 8, and ~85% of all linear scans end after no
more than 1 miss.

I'll try BYTE1 if I manage to do it. It turned out to be surprisingly
complicated :(

On Wed, Feb 10, 2021 at 5:04 PM Robert Muir  wrote:

> Peter, looks like you are way ahead of me :) Thanks for all the work
> you have been doing here, and thanks to Dawid for helping!
>
> You probably know a lot of this code better than me at this point, but
> I remember a couple of these pain points, inline below:
>
> On Wed, Feb 10, 2021 at 9:44 AM Peter Gromov
>  wrote:
> >
> > Hi Robert,
> >
> > Yes, having multiple dictionaries in the same process would increase the
> memory significantly. Do you have any idea about how many of them people
> are loading, and how much memory they give to Lucene?
>
> Yeah in many cases, the user is using a server such as solr or
> elasticsearch.
> Let's use solr as an example, as others are here to correct it, if I am
> wrong.
>
> Example to understand the challenges: user uses one of solr's 3
> mechanisms to detect language and send to different pipeline:
>
> https://lucene.apache.org/solr/guide/8_8/detecting-languages-during-indexing.html
> Now we know these language detectors are imperfect, if the user maps a
> lot of languages to hunspell pipelines, they may load lots of
> dictionaries, even by just one stray miscategorized document.
> So it doesn't have to be some extreme "enterprise" use-case like
> wikipedia.org, it can happen for a little guy faced with a
> multilingual corpus.
>
> Imagine the user decides to go further, and host solr search in this
> way for a couple local businesses or govt agencies.
> They support many languages and possibly use this detection scheme
> above to try to make language a "non-issue".
> The user may assign each customer a solr "core" (separate index) with
> this configuration.
> Does each solr core load its own HunspellStemFactory? I think it might
> (in isolated classloader), I could be wrong.
>
> For the elasticsearch case, maybe the resource usage in the same case
> is lower, because they reuse dictionaries per-node?
> I think this is how it works, but I honestly can't remember.
> Still the problem remains, easy to end up with dozens of these things in
> memory.
>
> Also we have the problem that memory usage for a specific can blow up
> in several ways.
> Some languages have bigger .aff file than .dic!
>
> > Thanks for the idea about root arcs. I've done some quick sampling and
> tracing (for German). 80% of root arc processing time is spent in direct
> addressing, and the remainder is linear scan (so root acrs don't seem to
> present major issues). For non-root arcs, ~50% is directly addressed, ~45%
> linearly-scanned, and the remainder binary-searched. Overall there's about
> 60% of direct addressing, both in time and invocation counts, which doesn't
> seem too bad (or am I mistaken?). Currently BYTE4 inputs are used. Reducing
> that might increase the number of directly addressed arcs, but I'm not sure
> that'd speed up much given that time and invocation counts seem to
> correlate.
> >
>
> Sure, but 20% of those linear scans are maybe 7x slower, its
> O(log2(alphabet_size)) right (assuming alphabet size ~ 128)?
> Hard to reason about, but maybe worth testing out. It still helps for
> all the other segmenters (japanese, korean) using fst.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: Hunspell performance

2021-02-10 Thread Peter Gromov
>
> at the price of not being able to enumerate all of node's outgoing arcs.
>

So FSTEnum isn't possible there? Too bad, I need it for suggestions.


Re: Hunspell performance

2021-02-10 Thread Peter Gromov
Hi Robert,

Yes, having multiple dictionaries in the same process would increase the
memory significantly. Do you have any idea about how many of them people
are loading, and how much memory they give to Lucene?

Yes, I've mentioned I've prototyped "using FST in a smarter way" :) Namely,
it's possible to cache the arcs/outputs used for searching for
"electrification" and reuse most of them after an affix is stripped and
we're now faced with "electrify". This allocates a bit more for each token,
but gives a noticeable speedup. I'm not entirely happy with the resulting
code complexity and performance, but I can create a PR.

I'm talking only about plain old affix removal. I have no inexact matching.
Decomposition basically works like "try to break the word in various places
and stem them separately, looking at some additional flags". For the first
word part, some arc/outputs could be reused from initial analysis, but for
the next ones most likely not. And when I tried the aforementioned reusing,
the code became so unpleasant that I started looking for alternatives :)

One thing I don't like about the arc caching approach is that it looks like
a dead end: the FST invocation count seems to be already close to minimal,
and yet its traversal is still very visible in the CPU snapshots. And I see
no low-hanging fruits in FST internals. They just seem to need
reading/analyzing too many bytes, doing much more work than a typical
hashmap access :)

Thanks for the idea about root arcs. I've done some quick sampling and
tracing (for German). 80% of root arc processing time is spent in direct
addressing, and the remainder is linear scan (so root acrs don't seem to
present major issues). For non-root arcs, ~50% is directly addressed, ~45%
linearly-scanned, and the remainder binary-searched. Overall there's about
60% of direct addressing, both in time and invocation counts, which doesn't
seem too bad (or am I mistaken?). Currently BYTE4 inputs are used. Reducing
that might increase the number of directly addressed arcs, but I'm not sure
that'd speed up much given that time and invocation counts seem to
correlate.

Peter

On Wed, Feb 10, 2021 at 2:52 PM Robert Muir  wrote:

> Just throwing out another random idea: if you are doing a lot of FST
> traversals (e.g. for inexact matching or decomposition), you may end
> out "hammering" the root arcs of the FST heavily, depending on how the
> algorithm works. Because root arcs are "busy", they end out being
> O(logN) lookups in the FST and get slow. Japanese and Korean analyzers
> are doing "decompounding" too, and have hacks to waste some RAM,
> ensuring the heavy root arc traversals are O(1):
>
> https://github.com/apache/lucene-solr/blob/7f8b7ffbcad2265b047a5e2195f76cc924028063/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/dict/TokenInfoFST.java
>
> Bruno did some FST improvements across the board here, but last time
> we checked, these hacks were still needed for segmentation usecases
> like this: see his benchmark here: https://s.apache.org/ffelc
>
> For example, maybe it makes sense to cache a few hundred nodes here in
> a similar way, depending on dictionary's alphabet size, to accelerate
> segmentation, I don't know if it will help. Maybe also the current FST
> "INPUT_TYPEs" are inappropriate and it would work better as BYTE1 FST
> rather than BYTE2 or BYTE4 or whatever it is using now. The current
> stemming doesn't put much pressure on this, so it isn't optimized.
>
> On Wed, Feb 10, 2021 at 7:53 AM Robert Muir  wrote:
> >
> > The RAM usage used to be bad as you describe, it blows up way worse
> > for other languages than German. There were many issues :)
> >
> > For Lucene, one common issue was that users wanted to have a lot of
> > these things in RAM: e.g. supporting many different languages on a
> > single server (multilingual data) and so forth.
> > Can we speed up your use-case by using the FST in a smarter way? Why
> > are there so many traversals... is it the way it is doing inexact
> > matching? decomposition?
> >
> > That was the trick done with stemming, and the stemming was
> > accelerated with some side data structures. For example "patternIndex"
> > thing which is a scary precomputed list of tableized DFAs... its
> > wasting a "little" space with these tables to speed up hotspot for
> > stemming. In that patternIndex example, some assumptions / limits had
> > to be set, that hopefully no dictionary would ever make: that's all
> > the "please report this to dev@lucene.apache.org" checks in the code.
> > some tests were run against all the crazy OO dictionaries out there to
> > examine the memory usage when looking at changes like this. Some o

Hunspell performance

2021-02-10 Thread Peter Gromov
Hi there,

I'm mostly done with supporting major Hunspell features necessary for most
european languages (https://issues.apache.org/jira/browse/LUCENE-9687) (but
of course I anticipate more minor fixes to come). Thanks Dawid Weiss for
thorough reviews and prompt accepting my PRs so far!

Now I'd like to make this Hunspell implementation at least as fast as the
native Hunspell called via JNI, ideally faster. Now it seems 1.5-3 times
slower for me, depending on the language (I've checked en/de/fr so far).
I've profiled it, done some minor optimizations, and now it appears that
most time is taken by FST traversals. I've prototyped decreasing the number
of these traversals, and the execution time goes down noticeably (e.g.
30%), but it's still not enough, and the code becomes complicated.

So I'm considering other data structures instead of FSTs (Hunspell/C++
itself doesn't bother with tries: it uses hash tables and linear searches
instead). The problem is, FST is very well space-optimized, and other data
structures consume more memory.

So my question is: what's the relative importance of speed and memory in
Lucene's stemmer? E.g. now the FST for German takes 2.2MB. Would it be OK
to use a CharArrayMap taking 20-25MB, but be much faster on lookup (45%
improvement in stemming)? Or, with a BytesRefHash plus an array I can make
it ~9MB, with almost the same speedup (but more complex code).

How much memory usage is acceptable at all?

Maybe there are other suitable data structures in Lucene core that I'm not
aware of? I basically need a Map, which'd be better queried
with a char[]+offset+length keys (like CharArrayMap does).

Peter


Enhancing Hunspell support

2021-01-11 Thread Peter Gromov
Hi,

I'd like to contribute to the support of Hunspell in Lucene, specifically:
* support the flags necessary for English, German, French, Spanish and
Russian dictionaries, possibly more languages later
* provide a public API to check if a word is misspelled
* mirror Hunspell's suggestion algorithm in Lucene, probably in the
"src/suggest" module

For context: I work on natural language support for IntelliJ-based IDEs.
We'd like to use Hunspell dictionaries there, but interfacing with native
binaries proved to be slow and unreliable. So we'd prefer a JVM-only
reimplementation of Hunspell spellchecker and suggester. Lucene's
Hunspell-related code currently seems closest to that goal, so we thought
we can enhance it further.

Is there anything non-obvious that I should know before diving into the
implementation?

The contribution will likely consist of many commits, dedicated to specific
subtasks or small refactorings. Should I file separate JIRA issues for each
of them, or having a single big one (e.g. "Hunspell improvements") is
enough?

Peter Gromov