In message <[email protected]>, Cy Schubert writes:In message <[email protected]>, Cy Schubert writes:
In message <[email protected]>, Mark Millard
write
s:
[This just puts my prior reply's material into Cy's
adjusted resend of the original. The To/Cc should
be coomplete this time.]
On Apr 12, 2023, at 22:52, Cy Schubert <[email protected]> =
wrote:
In message <[email protected]>, Mark =
Millard=20
write
s:
From: Charlie Li <vishwin_at_freebsd.org> wrote on
Date: Wed, 12 Apr 2023 20:11:16 UTC :
=20
Charlie Li wrote:
Mateusz Guzik wrote:
can you please test poudriere with
https://github.com/openzfs/zfs/pull/14739/files
=20
After applying, on the md(4)-backed pool regardless of =3D
block_cloning,=3D20
the cy@ `cp -R` test reports no differing (ie corrupted) files. =
Will=3D20=3D
=20
report back on poudriere results (no block_cloning).
=3D20
As for poudriere, build failures are still rolling in. These are =
(and=3D20=3D
=20
have been) entirely random on every run. Some examples from this =
run:
=3D20
lang/php81:
- post-install: @${INSTALL_DATA} ${WRKSRC}/php.ini-development=3D20
${WRKSRC}/php.ini-production ${WRKDIR}/php.conf =3D
${STAGEDIR}/${PREFIX}/etc
- consumers fail to build due to corrupted php.conf packaged
=3D20
devel/ninja:
- phase: stage
- install -s -m 555=3D20
/wrkdirs/usr/ports/devel/ninja/work/ninja-1.11.1/ninja=3D20
/wrkdirs/usr/ports/devel/ninja/work/stage/usr/local/bin
- consumers fail to build due to corrupted bin/ninja packaged
=3D20
devel/netsurf-buildsystem:
- phase: stage
- mkdir -p=3D20
=3D
=
/wrkdirs/usr/ports/devel/netsurf-buildsystem/work/stage/usr/local/share/n
e=
=3D
tsurf-buildsystem/makefiles=3D20
=3D
=
/wrkdirs/usr/ports/devel/netsurf-buildsystem/work/stage/usr/local/share/n
e=
=3D
tsurf-buildsystem/testtools
for M in Makefile.top Makefile.tools Makefile.subdir =3D
Makefile.pkgconfig=3D20
Makefile.clang Makefile.gcc Makefile.norcroft Makefile.open64; do \
cp makefiles/$M=3D20
=3D
=
/wrkdirs/usr/ports/devel/netsurf-buildsystem/work/stage/usr/local/share/n
e=
=3D
tsurf-buildsystem/makefiles/;=3D20
\
done
- graphics/libnsgif fails to build due to NUL characters in=3D20
Makefile.{clang,subdir}, causing nothing to link
=20
Summary: I have problems building ports into packages
via poudriere-devel use despite being fully updated/patched
(as of when I started the experiment), never having enabled
block_cloning ( still using openzfs-2.1-freebsd ).
=20
In other words, I can confirm other reports that have
been made.
=20
The details follow.
=20
=20
[Written as I was working on setting up for the experiments
and then executing those experiments, adjusting as I went
along.]
=20
I've run my own tests in a context that has never had the
zpool upgrade and that jump from before the openzfs import to
after the existing commits for trying to fix openzfs on
FreeBSD. I report on the sequence of activities getting to
the point of testing as well.
=20
By personal policy I keep my (non-temporary) pool's compatible
with what the most recent ??.?-RELEASE supports, using
openzfs-2.1-freebsd for now. The pools involved below have
never had a zpool upgrade from where they started. (I've no
pools that have ever had a zpool upgrade.)
=20
(Temporary pools are rare for me, such as this investigation.
But I'm not testing block_cloning or anything new this time.)
=20
I'll note that I use zfs for bectl, not for redundancy. So
my evidence is more limited in that respect.
=20
The activities were done on a HoneyComb (16 Cortex-A72 cores).
The system has and supports ECC RAM, 64 GiBytes of RAM are
present.
=20
I started by duplicating my normal zfs environment to an
external USB3 NVMe drive and adjusting the host name and such
to produce the below. (Non-debug, although I do not strip
symbols.) :
=20
# uname -apKU
FreeBSD CA72_4c8G_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #90 =3D
main-n261544-cee09bda03c8-dirty: Wed Mar 15 20:25:49 PDT 2023 =3D
=
root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main-src/arm
6=
=3D
4.aarch64/sys/GENERIC-NODBG-CA72 arm64 aarch64 1400082 1400082
=20
I then did: git fetch, stash push ., merge --ff-only, stash apply . :
my normal procedure. I then also applied the patch from:
=20
https://github.com/openzfs/zfs/pull/14739/files
=20
Then I did: buildworld buildkernel, install them, and rebooted.
=20
The result was:
=20
# uname -apKU
FreeBSD CA72_4c8G_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #91 =3D
main-n262122-2ef2c26f3f13-dirty: Wed Apr 12 19:23:35 PDT 2023 =3D
=
root@CA72_4c8G_ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main-src/arm
6=
=3D
4.aarch64/sys/GENERIC-NODBG-CA72 arm64 aarch64 1400086 1400086
=20
The later poudriere-devel based build of packages from ports is
based on:
=20
# ~/fbsd-based-on-what-commit.sh -C /usr/ports
4e94ac9eb97f (HEAD -> main, freebsd/main, freebsd/HEAD) =3D
devel/freebsd-gcc12: Bump to 12.2.0.
Author: John Baldwin <[email protected]>
Commit: John Baldwin <[email protected]>
CommitDate: 2023-03-25 00:06:40 +0000
branch: main
merge-base: 4e94ac9eb97fab16510b74ebcaa9316613182a72
merge-base: CommitDate: 2023-03-25 00:06:40 +0000
n613214 (--first-parent --count for merge-base)
=20
poudriere attempted to build 476 packages, starting
with pkg (in order to build the 56 that I explicitly
indicate that I want). It is my normal set of ports.
The form of building is biased to allowing a high
load average compared to the number of hardware
threads (same as cores here): each builder is allowed
to use the full count of hardware threads. The build
€ÏL€€€€‹ > > >> used USE_TMPFS=3D3D"data" instead of the USE_TMPFS=3D3Dall I
normally use on the build machine involved.
=20
And it produced some random errors during the attempted
builds. A type of example that is easy to interpret
without further exploration is:
=20
pkg_resources.extern.packaging.requirements.InvalidRequirement: Parse
=
=3D
error at "'\x00\x00\x00\x00\x00\x00\x00\x00'": Expected W:(0-9A-Za-z)
0
da0p8 ONLINE 0 0 0
=20
errors: No known data errors
=20
=20
=3D3D=3D3D=3D3D
Mark Millard
marklmi at yahoo.com
=20
=20
Let's try this again. Claws-mail didn't include the list address in =
the=20
header. Trying to reply, again, using exmh instead.
=20
=20
Did your pools suffer the EXDEV problem? The EXDEV also corrupted =
files.
As I reported, this was a jump from before the import
to as things are tonight (here). So: NO, unless the
existing code as of tonight still has the EXDEV problem!
Prior to this experiment I'd not progressed any media
beyond: main-n261544-cee09bda03c8-dirty Wed Mar 15 20:25:49.
I think, without sufficient investigation we risk jumping to
conclusions. I've taken an extremely cautious approach, rolling back
snapshots (as much as possible, i.e. poudriere datasets) when EXDEV
corruption was encountered.
Again: nothing between main-n261544-cee09bda03c8-dirty and
main-n262122-2ef2c26f3f13-dirty was involved at any stage.
=20
I did not rollback any snapshots in my MH mail directory. Rolling back
snapshots of my MH maildir would result in loss of email. I have to
live with that corruption. Corrupted files in my outgoing sent email
directory remain:
=20
slippy$ ugrep -cPa '\x00' ~/.Mail/note | grep -c :1=20
53
slippy$=20
=20
There are 53 corrupted files in my note log of 9913 emails. Those =
files
will never be fixed. They were corrupted by the EXDEV bug. Any new ZFS
or ZFS patches cannot retroactively remove the corruption from those
files.
=20
But my poudriere files, because the snapshots were rolled back, were
"repaired" by the rolled back snapshots.
=20
I'm not convinced that there is presently active corruption since
the problem has been fixed. I am convinced that whatever corruption
that was written at the time will remain forever or until those files
are deleted or replaced -- just like my email files written to disk at
the time.
My test results and procedure just do not fit your conclusion
that things are okay now if block_clonging is completely avoided.
Admitting I'm wrong: sending copies of my last reply to you back to myself,
again and again, three times, I've managed to reproduce the corruption you
are talking about.
This email itself was also corrupted. Below is what was sent. Good thing
multiple copies are saved by exmh.
Admitting I'm wrong: sending copies of my last reply to you back to myself,
again and again, three times, I've managed to reproduce the corruption you
are talking about.
This email itself was also corrupted. Below is what was sent. Good thing multiple copies are saved by exmh.Admitting I'm wrong: sending copies of my last reply to you back to myself, again and again, three times, I've managed to reproduce the corruption you are talking about.From my previous email to you.header. Trying to reply:::::::::, again, using exmh instead. ^^^^^^^^^Here it is, nine additional bytes of garbage. I've replaced the garbage with colons because nulls mess up a lot of things, including cut&paste.In another instance about 500 bytes were removed. I can reproduce the corruption at will now.The EXDEV patch is applied. Block_cloning is disabled.Somehow nulls and other garbage are inserted in the middle of emails after the ZFS upgrade.
Can you please try this patch:
Unfortunately I don’t see how this can happen with block cloning disabled. |