Thanks Robert and Dawid,
I think what you said is reasonable to me, I can keep the MP private then I
guess(and it's not hard to code it out anyway so I guess people can still
figure it out easily if they're facing a similar situation).
For our case I think we do have some other constraints so we have to
"clean" them every so often, so we still need to do that.

Anyway thank you for the interpretation of GDPR, I'm actually not sure what
exactly it's trying to enforce so it's a good learn for me as well.

Patrick


On Tue, Nov 28, 2023 at 2:48 PM Robert Muir <rcm...@gmail.com> wrote:

> and if you delete those segments, will that data ever be actually
> removed from the underlying physical storage? equally uncertain.
>
> deleting a file from the filesystem is similar to what lucene is
> doing, it doesn't really delete anything from the disk, just allows it
> to be overwritten by future writes.
>
> so I don't think we should provide any "GDPRMergePolicy" to satisfy an
> extreme (and short-sighted) legal interpretation. it wouldn't solve
> the problem anyway.
>
> On Tue, Nov 28, 2023 at 3:27 PM Ilan Ginzburg <ilans...@gmail.com> wrote:
> >
> > Are larger and older segments even certain to ever be merged in
> practice? I was assuming that if there is not a lot of new indexed content
> and not a lot of older documents being deleted, large older segment might
> never have to be merged.
> >
> >
> > On Tue 28 Nov 2023 at 20:53, Robert Muir <rcm...@gmail.com> wrote:
> >>
> >> I don't think there's any problem with GDPR, and I don't think users
> >> should be running unnecessary "optimize". GDRP just says data should
> >> be erased without "undue" delay. waiting for a merge to nuke the
> >> deleted docs isn't "undue", there is a good reason for it.
> >>
> >> On Tue, Nov 28, 2023 at 2:40 PM Patrick Zhai <zhai7...@gmail.com>
> wrote:
> >> >
> >> > Hi Folks,
> >> > In LinkedIn we need to comply with GDPR for a large part of our data,
> and an important part of it is that we need to be sure we have completely
> deleted the data the user requested to delete within a certain period of
> time.
> >> > The way we have come up with so far is to:
> >> > 1. Record the segment creation time somewhere (not decided yet, maybe
> index commit userinfo, maybe some other place outside of lucene)
> >> > 2. Create a new merge policy which delegate most operations to a
> normal MP, like TieredMergePolicy, and then add extra single-segment (merge
> from 1 segment to 1 segment, basically only do deletion) merges if it finds
> any segment is about to violate the GDPR time frame.
> >> >
> >> > So here's my question:
> >> > 1. Is there a better/existing way to do this?
> >> > 2. I would like to directly contribute to Lucene about such a merge
> policy since I think GDPR is more or less a common thing. Would like to
> know whether people feel like it's necessary or not?
> >> > 3. It's also nice if we can store the segment creation time to the
> index directly by IndexWriter (maybe write to SegmentInfo?), I can try to
> do that but would like to ask whether there's any objections?
> >> >
> >> > Best
> >> > Patrick
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Reply via email to