Re: [PATCH 1/5] Renaming weak prng invocations - prandom_bytes_state, prandom_u32_state

2022-12-14 Thread Theodore Ts'o
On Wed, Dec 14, 2022 at 05:21:17PM +0100, Stanislaw Gruszka wrote:
> On Wed, Dec 14, 2022 at 04:15:49PM +0100, Eric Dumazet wrote:
> > On Wed, Dec 14, 2022 at 1:34 PM Stanislaw Gruszka
> >  wrote:
> > >
> > > On Mon, Dec 12, 2022 at 03:35:20PM +0100, Jason A. Donenfeld wrote:
> > > > Please CC me on future revisions.
> > > >
> > > > As of 6.2, the prandom namespace is *only* for predictable randomness.
> > > > There's no need to rename anything. So nack on this patch 1/5.
> > >
> > > It is not obvious (for casual developers like me) that p in prandom
> > > stands for predictable. Some renaming would be useful IMHO.

I disagree.  pseudo-random has *always* menat "predictable".  And the
'p' in prandom was originally "pseudo-random".  In userspace,
random(3) is also pseudo-random, and is ***utterly*** predictable.  So
the original use of prandom() was a bit more of an explicit nod to the
fact that prandom is something which is inherently predictable.

So I don't think it's needed to rename it, whether it's to
"predictable_rng_prandom_u32", or 
"no_you_idiot_dont_you_dare_use_it_for_cryptographi_purposes_prandom_u32".

I think we need to assume a certain base level of competence,
especially for someone who is messing with security psensitive kernel
code.  If a developer doesn't know that a prng is predictable, that's
probably the *least* of the sort of mistakes that they might make.

- Ted


Re: [QUESTION] {start,stop}_this_handle() and lockdep annotations

2022-11-04 Thread Theodore Ts'o
Note: in the future, I'd recommend looking at the MAINTAINERS to
figure out a smaller list of people to ask this question, instead of
spamming everyone who has ever expressed interest in DEPT.


On Fri, Nov 04, 2022 at 02:56:32PM +0900, Byungchul Park wrote:
> Peterz (commit 34a3d1e8370870 lockdep: annotate journal_start()) and
> the commit message quoted what Andrew Morton said. It was like:
> 
> > Except lockdep doesn't know about journal_start(), which has ranking
> > requirements similar to a semaphore.
> 
> Could anyone tell what the ranking requirements in the journal code
> exactly means and what makes {start,stop}_this_handle() work for the
> requirements?

The comment from include/linux/jbd2.h may be helpful:

#ifdef CONFIG_DEBUG_LOCK_ALLOC
/**
 * @j_trans_commit_map:
 *
 * Lockdep entity to track transaction commit dependencies. Handles
 * hold this "lock" for read, when we wait for commit, we acquire the
 * "lock" for writing. This matches the properties of jbd2 journalling
 * where the running transaction has to wait for all handles to be
 * dropped to commit that transaction and also acquiring a handle may
 * require transaction commit to finish.
 */
struct lockdep_map  j_trans_commit_map;
#endif

And the reason why this isn't a problem is because start_this_handle()
can be passed a special handle which is guaranteed to not block
(because we've reserved journal credits for it).  Hence, there is no
risk that in _this_ call path start_this_handle() will block for a
commit:

> <4>[   43.124442 ] stacktrace:
> <4>[   43.124443 ]   start_this_handle+0x557/0x620
> <4>[   43.124445 ]   jbd2_journal_start_reserved+0x4d/0x1b0
   ^^^
> <4>[   43.124448 ]   __ext4_journal_start_reserved+0x6d/0x190
> <4>[   43.124450 ]   ext4_convert_unwritten_io_end_vec+0x22/0xd0
> <4>[   43.124453 ]   ext4_end_io_rsv_work+0xe4/0x190
> <4>[   43.124455 ]   process_one_work+0x301/0x660
> <4>[   43.124458 ]   worker_thread+0x3a/0x3c0
> <4>[   43.124459 ]   kthread+0xef/0x120
> <4>[   43.124462 ]   ret_from_fork+0x22/0x30


The comment for this function from fs/jbd2/transaction.c:

/**
 * jbd2_journal_start_reserved() - start reserved handle
 * @handle: handle to start
 * @type: for handle statistics
 * @line_no: for handle statistics
 *
 * Start handle that has been previously reserved with jbd2_journal_reserve().
 * This attaches @handle to the running transaction (or creates one if there's
 * not transaction running). Unlike jbd2_journal_start() this function cannot
 * block on journal commit, checkpointing, or similar stuff. It can block on
 * memory allocation or frozen journal though.
 *
 * Return 0 on success, non-zero on error - handle is freed in that case.
 */

And this is why this will never be a problem in real life, or flagged
by Lockdep, since Lockdep is a dynamic checker.  The deadlock which
the static DEPT checker has imagined can never, ever, ever happen.

For more context, also from fs/jbd2/transaction.c:

/**
 * jbd2_journal_start() - Obtain a new handle.
 * @journal: Journal to start transaction on.
 * @nblocks: number of block buffer we might modify
 *
 * We make sure that the transaction can guarantee at least nblocks of
 * modified buffers in the log.  We block until the log can guarantee
 * that much space. Additionally, if rsv_blocks > 0, we also create another
 * handle with rsv_blocks reserved blocks in the journal. This handle is
 * stored in h_rsv_handle. It is not attached to any particular transaction
 * and thus doesn't block transaction commit. If the caller uses this reserved
 * handle, it has to set h_rsv_handle to NULL as otherwise jbd2_journal_stop()
 * on the parent handle will dispose the reserved one. Reserved handle has to
 * be converted to a normal handle using jbd2_journal_start_reserved() before
 * it can be used.
 *
 * Return a pointer to a newly allocated handle, or an ERR_PTR() value
 * on failure.
 */

To be clear, I don't blame DEPT for not being able to figure this out;
it would require human-level intelligence to understand why in *this*
call path, we never end up waiting.  But this is why I am very
skeptical of static analyzers which are *sure* that anything they flag
is definitely a bug.  We definitely will need a quick and easy way to
tell DEPT, "go home, you're drunk".

Hope this helps,

- Ted


P.S.  Note: the actual function names are a bit misleading.  It looks
like the functions got refactored, and the documentation wasn't
updated to match.  Sigh... fortuantely the concepts are accurate; it's
just that function names needs to be fixed up.  For example, creating
a reserved handle is no longer done via jbd2_journal_start(), but
rather jbd2__journal_start().  



Re: [REPORT] syscall reboot + umh + firmware fallback

2022-05-12 Thread Theodore Ts'o
On Thu, May 12, 2022 at 08:18:24PM +0900, Byungchul Park wrote:
> I have a question about this one. Yes, it would never been stuck thanks
> to timeout. However, IIUC, timeouts are not supposed to expire in normal
> cases. So I thought a timeout expiration means not a normal case so need
> to inform it in terms of dependency so as to prevent further expiraton.
> That's why I have been trying to track even timeout'ed APIs.

As I beleive I've already pointed out to you previously in ext4 and
ocfs2, the jbd2 timeout every five seconds happens **all** the time
while the file system is mounted.  Commits more frequently than five
seconds is the exception case, at least for desktops/laptop workloads.

We *don't* get to the timeout only when a userspace process calls
fsync(2), or if the journal was incorrectly sized by the system
administrator so that it's too small, and the workload has so many
file system mutations that we have to prematurely close the
transaction ahead of the 5 second timeout.

> Do you think DEPT shouldn't track timeout APIs? If I was wrong, I
> shouldn't track the timeout APIs any more.

DEPT tracking timeouts will cause false positives in at least some
cases.  At the very least, there needs to be an easy way to suppress
these false positives on a per wait/mutex/spinlock basis.

 - Ted


Re: [Freedreno] Adding CI results to the kernel tree was Re: [RFC v2] drm/msm: Add initial ci/ subdirectory

2022-05-11 Thread Theodore Ts'o
On Wed, May 11, 2022 at 06:33:32AM -0700, Rob Clark wrote:
> 
> And ofc we want the expectations to be in the kernel tree because
> there could be, for example, differences between -fixes and -next
> branches.  (Or even stable kernel branches if/when we get to the point
> of running CI on those.)

There are tradeoffs both ways, whether the patches are kept separate,
opr in the kernel tree.

In the file system world, when we discover a bug, very often a test
case is found to test the fix, and to protect us against regressions.
It has one other benefit; since the tests (xfstests) are kept separate
from the kernel, it's a useful way to identify when some patch didn't
get automatically backported to a LTS or distro kernel.  (For example,
because the patch didn't cherry-pick cleanly and the manual backport
process fell through the cracks.)

It does make things annoying when we have bugs that can not be safely
backported (which results in tests that fail on the LTS kernel without
kernel-version exclude files), and/or when the expectations change
between versions.  (Although to be honest, for us, the more common
annoyance is when some userspace package --- e.g., bash or coreutils
or util-linux --- changes their output, and we have to add filter
functions to accomodate expected output differences.)

- Ted


Re: [PATCH RFC v6 00/21] DEPT(Dependency Tracker)

2022-05-09 Thread Theodore Ts'o
On Tue, May 10, 2022 at 09:32:13AM +0900, Byungchul Park wrote:
> Yes, right. DEPT has never been optimized. It rather turns on
> CONFIG_LOCKDEP and even CONFIG_PROVE_LOCKING when CONFIG_DEPT gets on
> because of porting issue. I have no choice but to rely on those to
> develop DEPT out of tree. Of course, that's what I don't like.

Sure, but blaming the overhead on unnecessary CONFIG_PROVE_LOCKING
overhead can explain only a tiny fraction of the slowdown.  Consider:
if time to first test (time to boot the kernel, setup the test
environment, figure out which tests to run, etc.) is 12 seconds w/o
LOCKDEP, 49 seconds with LOCKDEP/PROVE_LOCKING and 602 seconds with
DEPT, you can really only blame 37 seconds out of the 602 seconds of
DEPT on unnecessary PROVE_LOCKING overhead.

So let's assume we can get rid of all of the PROVE_LOCKING overhead.
We're still talking about 12 seconds for time-to-first test without
any lock debugging, versus ** 565 ** seconds for time-to-first test
with DEPT.  That's a factor of 47x for DEPT sans LOCKDEP overhead,
compared to a 4x overhead for PROVE_LOCKING.

> Plus, for now, I'm focusing on removing false positives. Once it's
> considered settled down, I will work on performance optimizaition. But
> it should still keep relying on Lockdep CONFIGs and adding additional
> overhead on it until DEPT can be developed in the tree.

Well, please take a look at the false positive which I reported.  I
suspect that in order to fix that particular false positive, we'll
either need to have a way to disable DEPT on waiting on all page/folio
dirty bits, or it will need to treat pages from different inodes
and/or address spaces as being entirely separate classes, instead of
collapsing all inode dirty bits, and all of various inode's mutexes
(such as ext4's i_data_sem) as being part of a single object class.

> DEPT is tracking way more objects than Lockdep so it's inevitable to be
> slower, but let me try to make it have the similar performance to
> Lockdep.

In order to eliminate some of these false positives, I suspect it's
going to increase the number of object classes that DEPT will need to
track even *more*.  At which point, the cost/benefit of DEPT may get
called into question, especially if all of the false positives can't
be suppressed.

- Ted


Re: [PATCH RFC v6 00/21] DEPT(Dependency Tracker)

2022-05-09 Thread Theodore Ts'o
Oh, one other problem with DEPT --- it's SLOW --- the overhead is
enormous.  Using kvm-xfstests[1] running "kvm-xfstests smoke", here
are some sample times:

LOCKDEP DEPT
Time to first test  49 seconds  602 seconds
ext4/0012 s 22 s
ext4/0032 s 8 s
ext4/0050 s 7 s
ext4/0201 s 8 s
ext4/02111 s17 s
ext4/0230 s 83 s
generic/001 4 s 76 s
generic/002 0 s 11 s
generic/003 10 s19 s

There are some large variations; in some cases, some xfstests take 10x
as much time or more to run.  In fact, when I first started the
kvm-xfstests run with DEPT, I thought something had hung and that
tests would never start.  (In fact, with gce-xfstests the default
watchdog "something has gone terribly wrong with the kexec" had fired,
and I didn't get any test results using gce-xfstests at all.  If DEPT
goes in without any optimizations, I'm going to have to adjust the
watchdogs timers for gce-xfstests.)

The bottom line is that at the moment, between the false positives,
and the significant overhead imposed by DEPT, I would suggest that if
DEPT ever does go in, that it should be possible to disable DEPT and
only use the existing CONFIG_PROVE_LOCKING version of LOCKDEP, just
because DEPT is S - L - O - W.

[1] 
https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-quickstart.md

- Ted

P.S.  Darrick and I both have disabled using LOCKDEP by default
because it slows down ext4 -g auto testing by a factor 2, and xfs -g
auto testing by a factor of 3.  So the fact that DEPT is a factor of
2x to 10x or more slower than LOCKDEP when running various xfstests
tests should be a real concern.



Re: [PATCH RFC v6 00/21] DEPT(Dependency Tracker)

2022-05-09 Thread Theodore Ts'o
I tried DEPT-v6 applied against 5.18-rc5, and it reported the
following positive.

The reason why it's nonsense is that in context A's [W] wait:

[ 1538.545054] [W] folio_wait_bit_common(pglocked:0):
[ 1538.545370] [] __filemap_get_folio+0x3e4/0x420
[ 1538.545763] stacktrace:
[ 1538.545928]   folio_wait_bit_common+0x2fa/0x460
[ 1538.546248]   __filemap_get_folio+0x3e4/0x420
[ 1538.546558]   pagecache_get_page+0x11/0x40
[ 1538.546852]   ext4_mb_init_group+0x80/0x2e0
[ 1538.547152]   ext4_mb_good_group_nolock+0x2a3/0x2d0

... we're reading the block allocation bitmap into the page cache.
This does not correspond to a real inode, and so we don't actually
take ei->i_data_sem in this on the psuedo-inode used.

In contast, context's B's [W] and [E]'s stack traces, the
folio_wait_bit is clearly associated with page which is mapped to a
real inode:

[ 1538.553656] [W] down_write(>i_data_sem:0):
[ 1538.553948] [] ext4_map_blocks+0x17b/0x680
[ 1538.554320] stacktrace:
[ 1538.554485]   ext4_map_blocks+0x17b/0x680
[ 1538.554772]   mpage_map_and_submit_extent+0xef/0x530
[ 1538.555122]   ext4_writepages+0x798/0x990
[ 1538.555409]   do_writepages+0xcf/0x1c0
[ 1538.555682]   __writeback_single_inode+0x58/0x3f0
[ 1538.556014]   writeback_sb_inodes+0x210/0x540
 ...

[ 1538.558621] [E] folio_wake_bit(pglocked:0):
[ 1538.558896] [] ext4_bio_write_page+0x400/0x560
[ 1538.559290] stacktrace:
[ 1538.559455]   ext4_bio_write_page+0x400/0x560
[ 1538.559765]   mpage_submit_page+0x5c/0x80
[ 1538.560051]   mpage_map_and_submit_buffers+0x15a/0x250
[ 1538.560409]   mpage_map_and_submit_extent+0x134/0x530
[ 1538.560764]   ext4_writepages+0x798/0x990
[ 1538.561057]   do_writepages+0xcf/0x1c0
[ 1538.561329]   __writeback_single_inode+0x58/0x3f0
...


In any case, this will ***never*** deadlock, and it's due to DEPT
fundamentally not understanding that waiting on different pages may be
due to inodes that come from completely different inodes, and so there
is zero possible chance this would never deadlock.

I suspect there will be similar false positives for tests (or
userspace) that uses copy_file_range(2) or send_file(2) system calls.

I've included the full DEPT log report below.

- Ted

generic/011 [20:11:16][ 1533.411773] run fstests generic/011 at 
2022-05-07 20:11:16
[ 1533.509603] DEPT_INFO_ONCE: Need to expand the ring buffer.
[ 1536.910044] DEPT_INFO_ONCE: Pool(wait) is empty.
[ 1538.533315] ===
[ 1538.533793] DEPT: Circular dependency has been detected.
[ 1538.534199] 5.18.0-rc5-xfstests-dept-00021-g8d3d751c9964 #571 Not tainted
[ 1538.534645] ---
[ 1538.535035] summary
[ 1538.535177] ---
[ 1538.535567] *** DEADLOCK ***
[ 1538.535567] 
[ 1538.535854] context A
[ 1538.536008] [S] down_write(>i_data_sem:0)
[ 1538.536323] [W] folio_wait_bit_common(pglocked:0)
[ 1538.536655] [E] up_write(>i_data_sem:0)
[ 1538.536958] 
[ 1538.537063] context B
[ 1538.537216] [S] (unknown)(pglocked:0)
[ 1538.537480] [W] down_write(>i_data_sem:0)
[ 1538.537789] [E] folio_wake_bit(pglocked:0)
[ 1538.538082] 
[ 1538.538184] [S]: start of the event context
[ 1538.538460] [W]: the wait blocked
[ 1538.538680] [E]: the event not reachable
[ 1538.538939] ---
[ 1538.539327] context A's detail
[ 1538.539530] ---
[ 1538.539918] context A
[ 1538.540072] [S] down_write(>i_data_sem:0)
[ 1538.540382] [W] folio_wait_bit_common(pglocked:0)
[ 1538.540712] [E] up_write(>i_data_sem:0)
[ 1538.541015] 
[ 1538.541119] [S] down_write(>i_data_sem:0):
[ 1538.541410] [] ext4_map_blocks+0x17b/0x680
[ 1538.541782] stacktrace:
[ 1538.541946]   ext4_map_blocks+0x17b/0x680
[ 1538.542234]   ext4_getblk+0x5f/0x1f0
[ 1538.542493]   ext4_bread+0xc/0x70
[ 1538.542736]   ext4_append+0x48/0xf0
[ 1538.542991]   ext4_init_new_dir+0xc8/0x160
[ 1538.543284]   ext4_mkdir+0x19a/0x320
[ 1538.543542]   vfs_mkdir+0x83/0xe0
[ 1538.543788]   do_mkdirat+0x8c/0x130
[ 1538.544042]   __x64_sys_mkdir+0x29/0x30
[ 1538.544319]   do_syscall_64+0x40/0x90
[ 1538.544584]   entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1538.544949] 
[ 1538.545054] [W] folio_wait_bit_common(pglocked:0):
[ 1538.545370] [] __filemap_get_folio+0x3e4/0x420
[ 1538.545763] stacktrace:
[ 1538.545928]   folio_wait_bit_common+0x2fa/0x460
[ 1538.546248]   __filemap_get_folio+0x3e4/0x420
[ 1538.546558]   pagecache_get_page+0x11/0x40
[ 1538.546852]   ext4_mb_init_group+0x80/0x2e0
[ 1538.547152]   ext4_mb_good_group_nolock+0x2a3/0x2d0
[ 1538.547496]   ext4_mb_regular_allocator+0x391/0x780
[ 1538.547840]   ext4_mb_new_blocks+0x44e/0x720
[ 

Re: [PATCH RFC v5 00/21] DEPT(Dependency Tracker)

2022-03-19 Thread Theodore Ts'o
On Fri, Mar 18, 2022 at 04:49:45PM +0900, Byungchul Park wrote:
> 
> I guess it was becasue of the commit b1fca27d384e8("kernel debug:
> support resetting WARN*_ONCE"). Your script seems to reset WARN*_ONCE
> repeatedly.

I wasn't aware this was being done, but your guess was correct.  The
WARN_ONCE state is getting cleared between each test in xfstests, with
the rationale (which IMO is quite reasonable) why this done in the
xfstests commit descrition:

commit c67ea2347454aebbe8eb6e825e9314d099b683da
Author: Lukas Czerner 
Date:   Wed Jul 15 13:42:19 2020 +0200

check: clear WARN_ONCE state before each test

clear WARN_ONCE state before each test to allow a potential problem
to be reported for each test

[Eryu: replace "/sys/kernel/debug" with $DEBUGFS_MNT ]

Signed-off-by: Lukas Czerner 
Reviewed-by: Zorro Lang 
Signed-off-by: Eryu Guan 

Cheers,

- Ted


Re: [PATCH RFC v5 00/21] DEPT(Dependency Tracker)

2022-03-16 Thread Theodore Ts'o
On Wed, Mar 16, 2022 at 11:26:12AM +0900, Byungchul Park wrote:
> I'm gonna re-add RFC for a while at Ted's request. But hard testing is
> needed to find false alarms for now that there's no false alarm with my
> system. I'm gonna look for other systems that might produce false
> alarms. And it'd be appreciated if you share it when you see any alarms
> with yours.

Is dept1.18_on_v5.17-rc7 roughly equivalent to the v5 version sent to
the list.  The commit date is March 16th, so I assume it was.  I tried
merging it with the ext4 dev branch, and tried enabling CONFIG_DEPT
and running xfstests.  The result was nearly test failing, because a 
DEPT warning.

I assume that this is due to some misconfiguration of DEPT on my part?
And I'm curious why DEPT_WARN_ONCE is apparently getting many, many
times?

[  760.990409] DEPT_WARN_ONCE: Pool(ecxt) is empty.
[  770.319656] DEPT_WARN_ONCE: Pool(ecxt) is empty.
[  772.460360] DEPT_WARN_ONCE: Pool(ecxt) is empty.
[  784.039676] DEPT_WARN_ONCE: Pool(ecxt) is empty.

(and this goes on over and over...)

Here's the full output of the DEPT warning from trying to run
generic/001.  There is a similar warning for generic/002, generic/003,
etc., for a total of 468 failures out of 495 tests run.

[  760.945068] run fstests generic/001 at 2022-03-16 08:16:53
[  760.985440] [ cut here ]
[  760.990409] DEPT_WARN_ONCE: Pool(ecxt) is empty.
[  760.995166] WARNING: CPU: 1 PID: 73369 at kernel/dependency/dept.c:297 
from_pool+0xc2/0x110
[  761.003915] CPU: 1 PID: 73369 Comm: bash Tainted: GW 
5.17.0-rc7-xfstests-00649-g5456f2312272 #520
[  761.014389] Hardware name: Google Google Compute Engine/Google Compute 
Engine, BIOS Google 01/01/2011
[  761.024363] RIP: 0010:from_pool+0xc2/0x110
[  761.028598] Code: 3d 32 62 96 01 00 75 c2 48 6b db 38 48 c7 c7 00 94 f1 ad 
48 89 04 24 c6 05 1a 62 96 01 01 48 8b b3 20 9a 2f ae e8 2f dd bf 00 <0f> 0b 48 
8b 04 24 eb 98 48 63 c2 48 0f af 86 28 9a 2f ae 48 03 86
[  761.048189] RSP: 0018:a7ce4425fd48 EFLAGS: 00010086
[  761.053617] RAX:  RBX: 00a8 RCX: 
[  761.060965] RDX: 0001 RSI: adfb95e0 RDI: 
[  761.068322] RBP: 001dc598 R08:  R09: a7ce4425fb90
[  761.075789] R10: fffe0aa0 R11: fffe0ae8 R12: 9768e07f0600
[  761.083063] R13:  R14: 0246 R15: 
[  761.090312] FS:  7fd4ecc4c740() GS:97699940() 
knlGS:
[  761.098623] CS:  0010 DS:  ES:  CR0: 80050033
[  761.104580] CR2: 563c61657eb0 CR3: 0001328fa001 CR4: 003706e0
[  761.111921] DR0:  DR1:  DR2: 
[  761.119171] DR3:  DR6: fffe0ff0 DR7: 0400
[  761.126617] Call Trace:
[  761.129175]  
[  761.131385]  add_ecxt+0x54/0x1c0
[  761.134736]  ? simple_attr_write+0x87/0x100
[  761.139063]  dept_event+0xaa/0x1d0
[  761.142687]  ? simple_attr_write+0x87/0x100
[  761.147089]  __mutex_unlock_slowpath+0x60/0x2d0
[  761.151866]  simple_attr_write+0x87/0x100
[  761.155997]  debugfs_attr_write+0x40/0x60
[  761.160124]  vfs_write+0xec/0x390
[  761.163557]  ksys_write+0x68/0xe0
[  761.167004]  do_syscall_64+0x43/0x90
[  761.170782]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  761.176204] RIP: 0033:0x7fd4ecd3df33
[  761.180010] Code: 8b 15 61 ef 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb 
b7 0f 1f 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 0f 05 <48> 3d 00 
f0 ff ff 77 55 c3 0f 1f 40 00 48 83 ec 28 48 89 54 24 18
[  761.199551] RSP: 002b:7ffe772d4808 EFLAGS: 0246 ORIG_RAX: 
0001
[  761.207240] RAX: ffda RBX: 0002 RCX: 7fd4ecd3df33
[  761.214583] RDX: 0002 RSI: 563c61657eb0 RDI: 0001
[  761.221835] RBP: 563c61657eb0 R08: 000a R09: 0001
[  761.229537] R10: 563c61902240 R11: 0246 R12: 0002
[  761.237239] R13: 7fd4ece0e6a0 R14: 0002 R15: 7fd4ece0e8a0
[  761.245283]  
[  761.247586] ---[ end trace  ]---
[  761.761829] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Quota 
mode: none.
[  769.903489] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Quota 
mode: none.

Let me know what I should do in order to fix this DEPT_WARN_ONCE?

Thanks,

- Ted


Re: Report 2 in ext4 and journal based on v5.17-rc1

2022-03-06 Thread Theodore Ts'o
On Sun, Mar 06, 2022 at 07:51:42PM +0900, Byungchul Park wrote:
> > 
> > Users of DEPT must not have to understand how DEPT works in order to
> 
> Users must not have to understand how Dept works for sure, and haters
> must not blame things based on what they guess wrong.

For the record, I don't hate DEPT.  I *fear* that DEPT will result in
my getting spammed with a huge number of false posiives once automated
testing systems like Syzkaller, zero-day test robot, etcs., get a hold
of it once it gets merged and start generating hundreds of automated
reports.

And when I tried to read the DEPT reports, and the DEPT documentation,
and I found that its explanation for why ext4 had a circular
dependency simply did not make sense.  If my struggles to understand
why DEPT was issuing a false positive is "guessing", then how do we
have discussions over how to make DEPT better?

> > called prepare-to-wait on more than one wait queue, how is DEPT going
> > to distinguish between your "morally correct" wkaeup source, and the
> > "rescue wakeup source"?
> 
> Sure, it should be done manually. I should do it on my own when that
> kind of issue arises.

The question here is how often will it need to be done, and how easy
will it be to "do it manually"?  Suppose we mark all of the DEPT false
positives before it gets merged?  How easy will it be able to suppress
future false positives in the future, as the kernel evolves?

Perhaps one method is to haved a way to take a particular wait queue,
or call to schedule(), or at the level of an entire kernel source
file, and opt it out from DEPT analysis?  That way, if DEPT gets
merged, and a maintainer starts getting spammed by bogus (or
incomprehensible) reports, there is a simople way they can annotate
their source code to prevent DEPT from analyzing code that it is
apparently not able to understand correctly.

That way we don't necessarily need to have a debate over how close to
zero percent false positives is necessary before DEPT can get merged.
And we avoid needing to force maintainers to prove that a DEPT report
is a false positive, which is from my experience hard to do, since
they get accused of being DEPT haters and not understanding DEPT.

 - Ted


Re: Report 2 in ext4 and journal based on v5.17-rc1

2022-03-05 Thread Theodore Ts'o
On Sat, Mar 05, 2022 at 11:55:34PM +0900, Byungchul Park wrote:
> > that is why some of the DEPT reports were completely incomprehensible
> 
> It's because you are blinded to blame at it without understanding how
> Dept works at all. I will fix those that must be fixed. Don't worry.

Users of DEPT must not have to understand how DEPT works in order to
understand and use DEPT reports.  If you think I don't understand how
DEPT work, I'm going to gently suggest that this means DEPT reports
are clear enough, and/or DEPT documentation needs to be
*substantially* improved, or both --- and these needs to happen before
DEPT is ready to be merged.

> > So if DEPT is issuing lots of reports about apparently circular
> > dependencies, please try to be open to the thought that the fault is
> 
> No one was convinced that Dept doesn't have a fault. I think your
> worries are too much.

In that case, may I ask that you add back a RFC to the subject prefix
(e.g., [PATCH RFC -v5]?)  Or maybe even add the subject prefix NOT YET
READY?  I have seen cases when after a patch series get to PATCH -v22,
and then people assume that it *must* be ready, as opposed what it
really means, which is "the author is just persistently reposting and
rebasing the patch series over and over again".  It would be helpful
if you directly acknowledge, in each patch submission, that it is not
yet ready for prime time.

After all, right now, DEPT has generated two reports in ext4, both of
which were false positives, and both of which have required a lot of
maintainer times to prove to you that they were in fact false
positives.  So are we all agreed that DEPT is not ready for prime
time?

> No one argued that their code must be buggy, either. So I don't think
> you have to worry about what's never happened.

Well, you kept on insisting that ext4 must have a circular dependency,
and that depending on a "rescue wakeup" is bad programming practice,
but you'll reluctantly agree to make DEPT accept "rescue wakeups" if
that is the will of the developers.  My concern here is the
fundmaental concept of "rescue wakeups" is wrong; I don't see how you
can distinguish between a valid wakeup and one that you and DEPT is
going to somehow characterize as dodgy.

Consider: a process can first subscribe to multiple wait queues, and
arrange to be woken up by a timeout, and then call schedule() to go to
sleep.  So it is not waiting on a single wait channel, but potentially
*multiple* wakeup sources.  If you are going to prove that kernel is
not going to make forward progress, you need to prove that *all* ways
that process might not wake up aren't going to happen for some reason.

Just because one wakeup source seems to form a circular dependency
proves nothing, since another wakeup source might be the designed and
architected way that code makes forward progress.

You seem to be assuminng that one wakeup source is somehow the
"correct" one, and the other ways that process could be woken up is a
"rescue wakeup source" and you seem to believe that relying on a
"rescue wakeup source" is bad.  But in the case of a process which has
called prepare-to-wait on more than one wait queue, how is DEPT going
to distinguish between your "morally correct" wkaeup source, and the
"rescue wakeup source"?

> No doubt. I already think so. But it doesn't mean that I have to keep
> quiet without discussing to imporve Dept. I will keep improving Dept in
> a reasonable way.

Well, I don't want to be in a position of having to prove that every
single DEPT report in my code that doesn't make sense to me is
nonsense, or else DEPT will get merged.

So maybe we need to reverse the burden of proof.

Instead of just sending a DEPT report, and then asking the maintainers
to explain why it is a false positive, how about if *you* use the DEPT
report to examinie the subsystem code, and then explain plain English,
how you think this could trigger in real life, or cause a performance
problem in real life or perhaps provide a script or C reproducer that
triggers the supposed deadlock?

Yes, that means you will need to understand the code in question, but
hopefully the DEPT reports should be clear enough that someone who
isn't a deep expert in the code should be able to spot the bug.  If
not, and if only a few deep experts of code in question will be able
to decipher the DEPT report and figure out a fix, that's really not
ideal.

If DEPT can find a real bug and you can show that Lockdep wouldn't
have been able to find it, then that would be proof that it is making
a real contribution.  That's would be real benefit.  At the same time,
DEPT will hopefully be able to demonstrate a false positive rate which
is low enough that the benefits clearly outweight the costs.

At the moment, I believe the scoreboard for DEPT with respect to ext4
is zero real bugs found, and two false positives, both of which have
required significant rounds of e-mail before the subsystem maintainers
were able to prove to you that it 

Re: Report 2 in ext4 and journal based on v5.17-rc1

2022-03-04 Thread Theodore Ts'o
On Fri, Mar 04, 2022 at 12:20:02PM +0900, Byungchul Park wrote:
> 
> I found a point that the two wait channels don't lead a deadlock in
> some cases thanks to Jan Kara. I will fix it so that Dept won't
> complain it.

I sent my last (admittedly cranky) message before you sent this.  I'm
glad you finally understood Jan's explanation.  I was trying to tell
you the same thing, but apparently I failed to communicate in a
sufficiently clear manner.  In any case, what Jan described is a
fundamental part of how wait queues work, and I'm kind of amazed that
you were able to implement DEPT without understanding it.  (But maybe
that is why some of the DEPT reports were completely incomprehensible
to me; I couldn't interpret why in the world DEPT was saying there was
a problem.)

In any case, the thing I would ask is a little humility.  We regularly
use lockdep, and we run a huge number of stress tests, throughout each
development cycle.

So if DEPT is issuing lots of reports about apparently circular
dependencies, please try to be open to the thought that the fault is
in DEPT, and don't try to argue with maintainers that their code MUST
be buggy --- but since you don't understand our code, and DEPT must be
theoretically perfect, that it is up to the Maintainers to prove to
you that their code is correct.

I am going to gently suggest that it is at least as likely, if not
more likely, that the failure is in DEPT or your understanding of what
how kernel wait channels and locking works.  After all, why would it
be that we haven't found these problems via our other QA practices?

Cheers,

- Ted


Re: Report 2 in ext4 and journal based on v5.17-rc1

2022-03-04 Thread Theodore Ts'o
On Fri, Mar 04, 2022 at 09:42:37AM +0900, Byungchul Park wrote:
> 
> All contexts waiting for any of the events in the circular dependency
> chain will be definitely stuck if there is a circular dependency as I
> explained. So we need another wakeup source to break the circle. In
> ext4 code, you might have the wakeup source for breaking the circle.
> 
> What I agreed with is:
> 
>The case that 1) the circular dependency is unevitable 2) there are
>another wakeup source for breadking the circle and 3) the duration
>in sleep is short enough, should be acceptable.
> 
> Sounds good?

These dependencies are part of every single ext4 metadata update,
and if there were any unnecessary sleeps, this would be a major
performance gap, and this is a very well studied part of ext4.

There are some places where we sleep, sure.  In some case
start_this_handle() needs to wait for a commit to complete, and the
commit thread might need to sleep for I/O to complete.  But the moment
the thing that we're waiting for is complete, we wake up all of the
processes on the wait queue.  But in the case where we wait for I/O
complete, that wakeupis coming from the device driver, when it
receives the the I/O completion interrupt from the hard drive.  Is
that considered an "external source"?  Maybe DEPT doesn't recognize
that this is certain to happen just as day follows the night?  (Well,
maybe the I/O completion interrupt might not happen if the disk drive
bursts into flames --- but then, you've got bigger problems. :-)

In any case, if DEPT is going to report these "circular dependencies
as bugs that MUST be fixed", it's going to be pure noise and I will
ignore all DEPT reports, and will push back on having Lockdep replaced
by DEPT --- because Lockdep give us actionable reports, and if DEPT
can't tell the difference between a valid programming pattern and a
bug, then it's worse than useless.

Sounds good?

- Ted


Re: Report 2 in ext4 and journal based on v5.17-rc1

2022-03-03 Thread Theodore Ts'o
On Thu, Mar 03, 2022 at 02:23:33PM +0900, Byungchul Park wrote:
> I totally agree with you. *They aren't really locks but it's just waits
> and wakeups.* That's exactly why I decided to develop Dept. Dept is not
> interested in locks unlike Lockdep, but fouces on waits and wakeup
> sources itself. I think you get Dept wrong a lot. Please ask me more if
> you have things you doubt about Dept.

So the question is this --- do you now understand why, even though
there is a circular dependency, nothing gets stalled in the
interactions between the two wait channels?

- Ted


Re: Report 2 in ext4 and journal based on v5.17-rc1

2022-03-02 Thread Theodore Ts'o
On Thu, Mar 03, 2022 at 10:00:33AM +0900, Byungchul Park wrote:
> 
> Unfortunately, it's neither perfect nor safe without another wakeup
> source - rescue wakeup source.
> 
>consumer   producer
> 
>   lock L
>   (too much work queued == true)
>   unlock L
>   --- preempted
>lock L
>unlock L
>do work
>lock L
>unlock L
>do work
>...
>(no work == true)
>sleep
>   --- scheduled in
>   sleep

That's not how things work in ext4.  It's **way** more complicated
than that.  We have multiple wait channels, one wake up the consumer
(read: the commit thread), and one which wakes up any processes
waiting for commit thread to have made forward progress.  We also have
two spin-lock protected sequence number, one which indicates the
current commited transaction #, and one indicating the transaction #
that needs to be committed.

On the commit thread, it will sleep on j_wait_commit, and when it is
woken up, it will check to see if there is work to be done
(j_commit_sequence != j_commit_request), and if so, do the work, and
then wake up processes waiting on the wait_queue j_wait_done_commit.
(Again, all of this uses the pattern, "prepare to wait", then check to
see if we should sleep, if we do need to sleep, unlock j_state_lock,
then sleep.   So this prevents any races leading to lost wakeups.

On the start_this_handle() thread, if we current transaction is too
full, we set j_commit_request to its transaction id to indicate that
we want the current transaction to be committed, and then we wake up
the j_wait_commit wait queue and then we enter a loop where do a
prepare_to_wait in j_wait_done_commit, check to see if
j_commit_sequence == the transaction id that we want to be completed,
and if it's not done yet, we unlock the j_state_lock spinlock, and go
to sleep.  Again, because of the prepare_to_wait, there is no chance
of a lost wakeup.

So there really is no "consumer" and "producer" here.  If you really
insist on using this model, which really doesn't apply, for one
thread, it's the consumer with respect to one wait queue, and the
producer with respect to the *other* wait queue.  For the other
thread, the consumer and producer roles are reversed.

And of course, this is a highly simplified model, since we also have a
wait queue used by the commit thread to wait for the number of active
handles on a particular transaction to go to zero, and
stop_this_handle() will wake up commit thread via this wait queue when
the last active handle on a particular transaction is retired.  (And
yes, that parameter is also protected by a different spin lock which
is per-transaction).

So it seems to me that a fundamental flaw in DEPT's model is assuming
that the only waiting paradigm that can be used is consumer/producer,
and that's simply not true.  The fact that you use the term "lock" is
also going to lead a misleading line of reasoning, because properly
speaking, they aren't really locks.  We are simply using wait channels
to wake up processes as necessary, and then they will check other
variables to decide whether or not they need to sleep or not, and we
have an invariant that when these variables change indicating forward
progress, the associated wait channel will be woken up.

Cheers,

- Ted


P.S.  This model is also highly simplified since there are other
reasons why the commit thread can be woken up, some which might be via
a timeout, and some which is via the j_wait_commit wait channel but
not because j_commit_request has been changed, but because file system
is being unmounted, or the file system is being frozen in preparation
of a snapshot, etc.  These are *not* necessary to prevent a deadlock,
because under normal circumstances the two wake channels are
sufficient of themselves.  So please don't think of them as "rescue
wakeup sources"; again, that's highly misleading and the wrong way to
think of them.

And to make things even more complicated, we have more than 2 wait
channel --- we have *five*:

/**
 * @j_wait_transaction_locked:
 *
 * Wait queue for waiting for a locked transaction to start committing,
 * or for a barrier lock to be released.
 */
wait_queue_head_t   j_wait_transaction_locked;

/**
 * @j_wait_done_commit: Wait queue for waiting for commit to complete.
 */
wait_queue_head_t   j_wait_done_commit;

/**
 * @j_wait_commit: Wait queue to trigger commit.
 */
wait_queue_head_t   j_wait_commit;

/**
 * @j_wait_updates: Wait queue to wait for updates to complete.
 */
wait_queue_head_t   j_wait_updates;

/**
 * @j_wait_reserved:
 *
 * Wait queue to wait 

Re: Report 2 in ext4 and journal based on v5.17-rc1

2022-02-28 Thread Theodore Ts'o
On Mon, Feb 28, 2022 at 11:14:44AM +0100, Jan Kara wrote:
> > case 1. Code with an actual circular dependency, but not deadlock.
> > 
> >A circular dependency can be broken by a rescue wakeup source e.g.
> >timeout. It's not a deadlock. If it's okay that the contexts
> >participating in the circular dependency and others waiting for the
> >events in the circle are stuck until it gets broken. Otherwise, say,
> >if it's not meant, then it's anyway problematic.
> > 
> >1-1. What if we judge this code is problematic?
> >1-2. What if we judge this code is good?
> > 
> > I've been wondering if the kernel guys esp. Linus considers code with
> > any circular dependency is problematic or not, even if it won't lead to
> > a deadlock, say, case 1. Even though I designed Dept based on what I
> > believe is right, of course, I'm willing to change the design according
> > to the majority opinion.
> > 
> > However, I would never allow case 1 if I were the owner of the kernel
> > for better stability, even though the code works anyway okay for now.

Note, I used the example of the timeout as the most obvious way of
explaining that a deadlock is not possible.  There is also the much
more complex explanation which Jan was trying to give, which is what
leads to the circular dependency.  It can happen that when trying to
start a handle, if either (a) there is not enough space in the journal
for new handles, or (b) the current transaction is so large that if we
don't close the transaction and start a new hone, we will end up
running out of space in the future, and so in that case,
start_this_handle() will block starting any more handles, and then
wake up the commit thread.  The commit thread then waits for the
currently running threads to complete, before it allows new handles to
start, and then it will complete the commit.  In the case of (a) we
then need to do a journal checkpoint, which is more work to release
space in the journal, and only then, can we allow new handles to start.

The botom line is (a) it works, (b) there aren't significant delays,
and for DEPT to complain that this is somehow wrong and we need to
completely rearchitect perfectly working code because it doesn't
confirm to DEPT's idea of what is "correct" is not acceptable.

> We have a queue of work to do Q protected by lock L. Consumer process has
> code like:
> 
> while (1) {
>   lock L
>   prepare_to_wait(work_queued);
>   if (no work) {
>   unlock L
>   sleep
>   } else {
>   unlock L
>   do work
>   wake_up(work_done)
>   }
> }
> 
> AFAIU Dept will create dependency here that 'wakeup work_done' is after
> 'wait for work_queued'. Producer has code like:
> 
> while (1) {
>   lock L
>   prepare_to_wait(work_done)
>   if (too much work queued) {
>   unlock L
>   sleep
>   } else {
>   queue work
>   unlock L
>   wake_up(work_queued)
>   }
> }
> 
> And Dept will create dependency here that 'wakeup work_queued' is after
> 'wait for work_done'. And thus we have a trivial cycle in the dependencies
> despite the code being perfectly valid and safe.

Cheers,

- Ted


Re: [PATCH 00/16] DEPT(Dependency Tracker)

2022-02-17 Thread Theodore Ts'o
On Thu, Feb 17, 2022 at 12:00:05PM -0500, Steven Rostedt wrote:
> 
> I personally believe that there's potential that this can be helpful and we
> will want to merge it.
> 
> But, what I believe Ted is trying to say is, if you do not know if the
> report is a bug or not, please do not ask the maintainers to determine it
> for you. This is a good opportunity for you to look to see why your tool
> reported an issue, and learn that subsystem. Look at if this is really a
> bug or not, and investigate why.

I agree there's potential here, or I would have ignored the ext4 "bug
report".

When we can get rid of the false positives, I think it should be
merged; I'd just rather it not be merged until after the false
positives are fixed, since otherwise, someone well-meaning will start
using it with Syzkaller, and noise that maintainers need to deal with
(with people requesting reverts of two year old commits, etc) will
increase by a factor of ten or more.  (With Syzbot reproducers that
set up random cgroups, IP tunnels with wiregaurd enabled, FUSE stress
testers, etc., that file system maintainers will be asked to try to
disentangle.)

So from a maintainer's perspective, false positives are highly
negative.  It may be that from some people's POV, one bug found and 20
false positive might still be "useful".  But if your tool gains a
reputation of not valuing maintainers' time, it's just going to make
us (or at least me :-) cranky, and it's going to be very hard to
recover from perception.  So it's probably better to be very
conservative and careful in polishing it before asking for it to be
merged.

Cheers,

- Ted



Re: [PATCH 00/16] DEPT(Dependency Tracker)

2022-02-17 Thread Theodore Ts'o
On Thu, Feb 17, 2022 at 07:57:36PM +0900, Byungchul Park wrote:
> 
> I've got several reports from the tool. Some of them look like false
> alarms and some others look like real deadlock possibility. Because of
> my unfamiliarity of the domain, it's hard to confirm if it's a real one.
> Let me add the reports on this email thread.

The problem is we have so many potentially invalid, or
so-rare-as-to-be-not-worth-the-time-to-investigate-in-the-
grand-scheme-of-all-of-the-fires-burning-on-maintainers laps that it's
really not reasonable to ask maintainers to determine whether
something is a false alarm or not.  If I want more of these unreliable
potential bug reports to investigate, there is a huge backlog in
Syzkaller.  :-)

Looking at the second ext4 report, it doesn't make any sense.  Context
A is the kjournald thread.  We don't do a commit until (a) the timeout
expires, or (b) someone explicitly requests that a commit happen
waking up j_wait_commit.  I'm guessing that complaint here is that
DEPT thinks nothing is explicitly requesting a wake up.  But note that
after 5 seconds (or whatever journal->j_commit_interval) is configured
to be we *will* always start a commit.  So ergo, there can't be a deadlock.

At a higher level of discussion, it's an unfair tax on maintainer's
times to ask maintainers to help you debug DEPT for you.  Tools like
Syzkaller and DEPT are useful insofar as they save us time in making
our subsystems better.  But until you can prove that it's not going to
be a massive denial of service attack on maintainer's time, at the
*very* least keep an RFC on the patch, or add massive warnings that
more often than not DEPT is going to be sending maintainers on a wild
goose chase.

If you know there there "appear to be false positives", you need to
make sure you've tracked them all down before trying to ask that this
be merged.

You may also want to add some documentation about why we should trust
this; in particular for wait channels, when a process calls schedule()
there may be multiple reasons why the thread will wake up --- in the
worst case, such as in the select(2) or epoll(2) system call, there
may be literally thousands of reasons (one for every file desriptor
the select is waiting on) --- why the process will wake up and thus
resolve the potential "deadlock" that DEPT is worrying about.  How is
DEPT going to handle those cases?  If the answer is that things need
to be tagged, then at least disclose potential reasons why DEPT might
be untrustworthy to save your reviewers time.

I know that you're trying to help us, but this tool needs to be far
better than Lockdep before we should think about merging it.  Even if
it finds 5% more potential deadlocks, if it creates 95% more false
positive reports --- and the ones it finds are crazy things that
rarely actually happen in practice, are the costs worth the benefits?
And who is bearing the costs, and who is receiving the benefits?

Regards,

- Ted


Re: [PATCH V2 00/13] use time_is_xxx() instead of jiffies judgment

2022-02-12 Thread Theodore Ts'o
On Thu, Feb 10, 2022 at 06:30:23PM -0800, Qing Wang wrote:
> From: Wang Qing 
> 
> It is better to use time_is_xxx() directly instead of jiffies judgment
> for understanding.

Hi Wang,

"judgement" doesn't really make sense as a description to an English
speaker.  The following a commit desription (for all of these series)
is probably going to be a bit more understable:

Use the helper function time_is_{before,after}_jiffies() to improve
code readability.

Cheers,

- Ted


Re: [PATCH v1 01/14] ext4/xfs: add page refcount helper

2021-08-25 Thread Theodore Ts'o
On Tue, Aug 24, 2021 at 10:48:15PM -0500, Alex Sierra wrote:
> From: Ralph Campbell 
> 
> There are several places where ZONE_DEVICE struct pages assume a reference
> count == 1 means the page is idle and free. Instead of open coding this,
> add a helper function to hide this detail.
> 
> Signed-off-by: Ralph Campbell 
> Signed-off-by: Alex Sierra 
> Reviewed-by: Christoph Hellwig 

Acked-by: Theodore Ts'o 


Re: [PATCH v3 0/8] Support DEVICE_GENERIC memory in migrate_vma_*

2021-06-20 Thread Theodore Ts'o
On Thu, Jun 17, 2021 at 10:16:57AM -0500, Alex Sierra wrote:
> v1:
> AMD is building a system architecture for the Frontier supercomputer with a
> coherent interconnect between CPUs and GPUs. This hardware architecture allows
> the CPUs to coherently access GPU device memory. We have hardware in our labs
> and we are working with our partner HPE on the BIOS, firmware and software
> for delivery to the DOE.
> 
> The system BIOS advertises the GPU device memory (aka VRAM) as SPM
> (special purpose memory) in the UEFI system address map. The amdgpu driver 
> looks
> it up with lookup_resource and registers it with devmap as 
> MEMORY_DEVICE_GENERIC
> using devm_memremap_pages.
> 
> Now we're trying to migrate data to and from that memory using the 
> migrate_vma_*
> helpers so we can support page-based migration in our unified memory 
> allocations,
> while also supporting CPU access to those pages.
> 
> This patch series makes a few changes to make MEMORY_DEVICE_GENERIC pages 
> behave
> correctly in the migrate_vma_* helpers. We are looking for feedback about this
> approach. If we're close, what's needed to make our patches acceptable 
> upstream?
> If we're not close, any suggestions how else to achieve what we are trying to 
> do
> (i.e. page migration and coherent CPU access to VRAM)?

Is there a way we can test the codepaths touched by this patchset?  It
doesn't have to be via a complete qemu simulation of the GPU device
memory, but some way of creating MEMORY_DEVICE_GENERIC subject to
migrate_vma_* helpers so we can test for regressions moving forward.

Thanks,

- Ted


Re: [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols

2021-05-12 Thread Theodore Ts'o
On Wed, May 12, 2021 at 02:50:04PM +0200, Mauro Carvalho Chehab wrote:
> v2:
> - removed EM/EN DASH conversion from this patchset;

Are you still thinking about doing the

EN DASH --> "--"
EM DASH --> "---"

conversion?  That's not going to change what the documentation will
look like in the HTML and PDF output forms, and I think it would make
life easier for people are reading and editing the Documentation/*
files in text form.

- Ted


Re: [PATCH 00/53] Get rid of UTF-8 chars that can be mapped as ASCII

2021-05-10 Thread Theodore Ts'o
On Mon, May 10, 2021 at 02:49:44PM +0100, David Woodhouse wrote:
> On Mon, 2021-05-10 at 13:55 +0200, Mauro Carvalho Chehab wrote:
> > This patch series is doing conversion only when using ASCII makes
> > more sense than using UTF-8. 
> > 
> > See, a number of converted documents ended with weird characters
> > like ZERO WIDTH NO-BREAK SPACE (U+FEFF) character. This specific
> > character doesn't do any good.
> > 
> > Others use NO-BREAK SPACE (U+A0) instead of 0x20. Harmless, until
> > someone tries to use grep[1].
> 
> Replacing those makes sense. But replacing emdashes — which are a
> distinct character that has no direct replacement in ASCII and which
> people do *deliberately* use instead of hyphen-minus — does not.

I regularly use --- for em-dashes and -- for en-dashes.  Markdown will
automatically translate 3 ASCII hypens to em-dashes, and 2 ASCII
hyphens to en-dashes.  It's much, much easier for me to type 2 or 3
hypens into my text editor of choice than trying to enter the UTF-8
characters.  If we can make sphinx do this translation, maybe that's
the best way of dealing with these two characters?

Cheers,

- Ted


Re: [PATCH v5 00/18] kunit: introduce KUnit, the Linux kernel unit testing framework

2019-06-21 Thread Theodore Ts'o
On Fri, Jun 21, 2019 at 08:59:48AM -0600, shuah wrote:
> > > ### But wait! Doesn't kselftest support in kernel testing?!
> > >
> > > 
> 
> I think I commented on this before. I agree with the statement that
> there is no overlap between Kselftest and KUnit. I would like see this
> removed. Kselftest module support supports use-cases KUnit won't be able
> to. I can build an kernel with Kselftest test modules and use it in the
> filed to load and run tests if I need to debug a problem and get data
> from a system. I can't do that with KUnit.
> 
> In my mind, I am not viewing this as which is better. Kselftest and
> KUnit both have their place in the kernel development process. It isn't
> productive and/or necessary to comparing Kselftest and KUnit without a
> good understanding of the problem spaces for each of these.
>
> I would strongly recommend not making reference to Kselftest and talk
> about what KUnit offers.

Shuah,

Just to recall the history, this section of the FAQ was added to rebut
the ***very*** strong statements that Frank made that there was
overlap between Kselftest and Kunit, and that having too many ways for
kernel developers to do the identical thing was harmful (he said it
was too much of a burden on a kernel developer) --- and this was an
argument for not including Kunit in the upstream kernel.

If we're past that objection, then perhaps this section can be
dropped, but there's a very good reason why it was there.  I wouldn't
Brendan to be accused of ignoring feedback from those who reviewed his
patches.   :-)

- Ted


Re: [PATCH v2 00/17] kunit: introduce KUnit, the Linux kernel unit testing framework

2019-05-14 Thread Theodore Ts'o
On Tue, May 14, 2019 at 05:26:47PM -0700, Frank Rowand wrote:
> On 5/11/19 10:33 AM, Theodore Ts'o wrote:
> > On Fri, May 10, 2019 at 02:12:40PM -0700, Frank Rowand wrote:
> >> However, the reply is incorrect.  Kselftest in-kernel tests (which
> >> is the context here) can be configured as built in instead of as
> >> a module, and built in a UML kernel.  The UML kernel can boot,
> >> running the in-kernel tests before UML attempts to invoke the
> >> init process.
> > 
> > Um, Citation needed?
> 
> The paragraph that you quoted tells you exactly how to run a kselftest
> in-kernel test in a UML kernel.  Just to what that paragraph says.

I didn't quote a paragraph.  But I'll quote from it now:

  $ make -C tools/testing/selftests run_tests

This runs the kselftest harness, *in userspace*.  That means you have
to have a root file system, and it's run after init has started, by
default.  You asserted that kselftests allows you to run modules
before init has started.  There is absolutely zero, cero, nada, zilch
mentions of any of anything like that in Documentation/dev-tools/kselftests.rst

> > There exists test modules in the kernel that run before the init
> > scripts run --- but that's not strictly speaking part of kselftests,
> > and do not have any kind of infrastructure.  As noted, the
> > kselftests_harness header file fundamentally assumes that you are
> > running test code in userspace.
> 
> You are ignoring the kselftest in-kernel tests.

I'm talking specifically about what you have been *claiming* to be
kselftest in-kernel tests above.  And I'm asserting they are really
not kselftests.  They are just ad hoc tests that are run in kernel
space, which, when compiled as modules, can be loaded by a kselftest
shell script.  You can certainly hook in these ad hoc in-kernel tests
via kselftests --- but then they aren't run before init starts,
because kselftests is inherently a userspace-driven system.

If you build these tests (many of which existed before kselftests was
merged) into the kernel such that they are run before init starts,
without the kselftest harness, then they are not kselftests, by
definition.  Both in how they are run, and since many of these
in-kernel tests predate the introduction of kselftests --- in some
cases, by many years.

> We are talking in circles.  I'm done with this thread.

Yes, that sounds like it would be best.

- Ted
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v2 00/17] kunit: introduce KUnit, the Linux kernel unit testing framework

2019-05-11 Thread Theodore Ts'o
On Fri, May 10, 2019 at 02:12:40PM -0700, Frank Rowand wrote:
> However, the reply is incorrect.  Kselftest in-kernel tests (which
> is the context here) can be configured as built in instead of as
> a module, and built in a UML kernel.  The UML kernel can boot,
> running the in-kernel tests before UML attempts to invoke the
> init process.

Um, Citation needed?

I don't see any evidence for this in the kselftest documentation, nor
do I see any evidence of this in the kselftest Makefiles.

There exists test modules in the kernel that run before the init
scripts run --- but that's not strictly speaking part of kselftests,
and do not have any kind of infrastructure.  As noted, the
kselftests_harness header file fundamentally assumes that you are
running test code in userspace.

- Ted
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v2 00/17] kunit: introduce KUnit, the Linux kernel unit testing framework

2019-05-10 Thread Theodore Ts'o
On Thu, May 09, 2019 at 10:11:01PM -0700, Frank Rowand wrote:
> >> You *can* run in-kernel test using modules; but there is no framework
> >> for the in-kernel code found in the test modules, which means each of
> >> the in-kernel code has to create their own in-kernel test
> >> infrastructure.
> 
> The kselftest in-kernel tests follow a common pattern.  As such, there
> is a framework.

So we may have different definitions of "framework".  In my book, code
reuse by "cut and paste" does not make a framework.  Could they be
rewritten to *use* a framework, whether it be KTF or KUnit?  Sure!
But they are not using a framework *today*.

> This next two paragraphs you ignored entirely in your reply:
> 
> > Why create an entire new subsystem (KUnit) when you can add a header
> > file (and .c code as appropriate) that outputs the proper TAP formatted
> > results from kselftest kernel test modules?

And you keep ignoring my main observation, which is that spinning up a
VM, letting systemd start, mounting a root file system, etc., is all
unnecessary overhead which takes time.  This is important to me,
because developer velocity is extremely important if you are doing
test driven development.

Yes, you can manually unload a module, recompile the module, somehow
get the module back into the VM (perhaps by using virtio-9p), and then
reloading the module with the in-kernel test code, and the restart the
test.  BUT: (a) even if it is faster, it requires a lot of manual
steps, and would be very hard to automate, and (b) if the test code
ever OOPS or triggers a lockdep warning, you will need to restart the
VM, and so this involves all of the VM restart overhead, plus trying
to automate determining when you actually do need to restart the VM
versus unloading and reloading the module.   It's clunky.

Being able to do the equivalent of "make && make check" is a really
big deal.  And "make check" needs to go fast.

You keep ignore this point, perhaps because you don't care about this
issue?  Which is fine, and why we may just need to agree to disagree.

Cheers,

- Ted

P.S.  Running scripts is Turing-equivalent, so it's self-evident that
*anything* you can do with other test frameworks you can somehow do in
kselftests.  That argument doesn't impress me, and why I do consider
it quite flippant.  (Heck, /bin/vi is Turing equivalent so we could
use vi to as a kernel test framework.  Or we could use emacs.  Let's
not.  :-)

The question is whether it is the most best and most efficient way to
do that testing.  And developer velocity is a really big part of my
evaluation function when judging whether or a test framework is fit
for that purpose.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v2 00/17] kunit: introduce KUnit, the Linux kernel unit testing framework

2019-05-09 Thread Theodore Ts'o
On Thu, May 09, 2019 at 05:40:48PM -0600, Logan Gunthorpe wrote:
> 
> Based on some of the other commenters, I was under the impression that
> kselftests had in-kernel tests but I'm not sure where or if they exist. If
> they do exists, it seems like it would make sense to convert those to kunit
> and have Kunit tests run-able in a VM or baremetal instance.

There are kselftests tests which are shell scripts which load a
module, and the module runs the in-kernel code.  However, I didn't see
much infrastructure for the in-kernel test code; the one or two test
modules called from kselftests looked pretty ad hoc to me.

That's why I used the "vise grips" analogy.  You can use a pair of
vise grips like a monkey wrench; but it's not really a monkey wrench,
and might not be the best tool to loosen or tighten nuts and bolts.

   - Ted
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v2 00/17] kunit: introduce KUnit, the Linux kernel unit testing framework

2019-05-09 Thread Theodore Ts'o
On Thu, May 09, 2019 at 04:20:05PM -0600, Logan Gunthorpe wrote:
> 
> The second item, arguably, does have significant overlap with kselftest.
> Whether you are running short tests in a light weight UML environment or
> higher level tests in an heavier VM the two could be using the same
> framework for writing or defining in-kernel tests. It *may* also be valuable
> for some people to be able to run all the UML tests in the heavy VM
> environment along side other higher level tests.
> 
> Looking at the selftests tree in the repo, we already have similar items to
> what Kunit is adding as I described in point (2) above. kselftest_harness.h
> contains macros like EXPECT_* and ASSERT_* with very similar intentions to
> the new KUNIT_EXECPT_* and KUNIT_ASSERT_* macros.
> 
> However, the number of users of this harness appears to be quite small. Most
> of the code in the selftests tree seems to be a random mismash of scripts
> and userspace code so it's not hard to see it as something completely
> different from the new Kunit:
> 
> $ git grep --files-with-matches kselftest_harness.h *

To the extent that we can unify how tests are written, I agree that
this would be a good thing.  However, you should note that
kselftest_harness.h is currently assums that it will be included in
userspace programs.  This is most obviously seen if you look closely
at the functions defined in the header files which makes calls to
fork(), abort() and fprintf().

So Kunit can't reuse kselftest_harness.h unmodified.  And whether or
not the actual implementation of the header file can be reused or
refactored, making the unit tests use the same or similar syntax would
be a good thing.

Cheers,

- Ted
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v2 00/17] kunit: introduce KUnit, the Linux kernel unit testing framework

2019-05-09 Thread Theodore Ts'o
On Thu, May 09, 2019 at 11:12:12AM -0700, Frank Rowand wrote:
> 
>"My understanding is that the intent of KUnit is to avoid booting a kernel 
> on
>real hardware or in a virtual machine.  That seems to be a matter of 
> semantics
>to me because isn't invoking a UML Linux just running the Linux kernel in
>a different form of virtualization?
> 
>So I do not understand why KUnit is an improvement over kselftest.
> 
>...
> 
>What am I missing?"

One major difference: kselftest requires a userspace environment; it
starts systemd, requires a root file system from which you can load
modules, etc.  Kunit doesn't require a root file system; doesn't
require that you start systemd; doesn't allow you to run arbitrary
perl, python, bash, etc. scripts.  As such, it's much lighter weight
than kselftest, and will have much less overhead before you can start
running tests.  So it's not really the same kind of virtualization.

Does this help?

- Ted
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v2 00/17] kunit: introduce KUnit, the Linux kernel unit testing framework

2019-05-09 Thread Theodore Ts'o
On Thu, May 09, 2019 at 01:52:15PM +0200, Knut Omang wrote:
> 1) Tests that exercises typically algorithmic or intricate, complex
>code with relatively few outside dependencies, or where the dependencies 
>are considered worth mocking, such as the basics of container data 
>structures or page table code. If I get you right, Ted, the tests 
>you refer to in this thread are such tests. I believe covering this space 
>is the goal Brendan has in mind for KUnit.

Yes, that's correct.  I'd also add that one of the key differences is
that it sounds like Frank and you are coming from the perspective of
testing *device drivers* where in general there aren't a lot of
complex code which is hardware independent.  After all, the vast
majority of device drivers are primarily interface code to hardware,
with as much as possible abstracted away to common code.  (Take, for
example, the model of the SCSI layer; or all of the kobject code.)

> 2) Tests that exercises interaction between a module under test and other 
>parts of the kernel, such as testing intricacies of the interaction of 
>a driver or file system with the rest of the kernel, and with hardware, 
>whether that is real hardware or a model/emulation. 
>Using your testing needs as example again, Ted, from my shallow 
> understanding,
>you have such needs within the context of xfstests 
> (https://github.com/tytso/xfstests)

Well, upstream is for xfstests is 
git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git

The test framework where I can run 20 hours worth of xfstests
(multiple file system features enabled, multiple mount options, etc.)
in 3 hours of wall clock time using multiple cloud VM is something
called gce-xfstests.

I also have kvm-xfstests, which optimizes low test latency, where I
want to run a one or a small number of tests with a minimum of
overhead --- gce startup and shutdown is around 2 minutes, where as
kvm startup and shutdown is about 7 seconds.  As far as I'm concerned,
7 seconds is still too slow, but that's the best I've been able to do
given all of the other things I want a test framework to do, including
archiving test results, parsing the test results so it's easy to
interpret, etc.  Both kvm-xfstests and gce-xfstests are located at:

git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git

So if Frank's primary argument is "too many frameworks", it's already
too late.  The block layer has blktests has a seprate framework,
called blktests --- and yeah, it's a bit painful to launch or learn
how to set things up.

That's why I added support to run blktests using gce-xfstests and
kvm-xfstests, so that "gce-xfstests --blktests" or "kvm-xfstests
--xfstests" will pluck a kernel from your build tree, and launch at
test appliance VM using that kernel and run the block layer tests.

The point is we *already* have multiple test frameworks, which are
optimized for testing different parts of the kernel.  And if you plan
to do a lot of work in these parts of the kernel, you're going to have
to learn how to use some other test framework other than kselftest.
Sorry, that's just the way it goes.

Of course, I'll accept trivial patches that haven't been tested using
xfstests --- but that's because I can trivially run the smoke test for
you.  Of course, if I get a lot of patches from a contributor which
cause test regressions, I'll treat them much like someone who
contribute patches which fail to build.  I'll apply pressure to the
contributor to actually build test, or run a ten minute kvm-xfstests
smoke test.  Part of the reason why I feel comfortable to do this is
it's really easy to run the smoke test.  There are pre-compiled test
appliances, and a lot of documentation:

https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-quickstart.md

This is why I have close to zero sympathy to Frank's complaint that
extra test frameworks are a bad thing.  To me, that's whining.  I've
done a huge amount of work to meet contributors more than half-way.
The insistence that "There Must Be One", ala the Highlander movie, is
IMHO so wrong that it's not even close.  Is it really that hard to do
a "git pull", download a test appliance, set up a config file to tell
kvm-xfstests where to find your build tree, and then run "kvm-xfstests
--smoke" or "gce-xfstests --smoke"?  Cry me a river.

There are already multiple test frameworks, and if you expect to do a
lot of work in a particular subsystem, you'll be expected to use the
Maintainer's choice of tests.  Deal with it.  We do this so we can
scale to the number of contributors we have in our subsystem.

> To 1) I agree with Frank in that the problem with using UML is that you still 
> have to
> relate to the complexity of a kernel run time system, while what you really 
> want for these
> types of tests is just to compile a couple of kernel source files in a normal 
> user land
> context, to allow the use of Valgrind and other user space tools on the code.

"Just 

Re: [PATCH v2 00/17] kunit: introduce KUnit, the Linux kernel unit testing framework

2019-05-08 Thread Theodore Ts'o
On Wed, May 08, 2019 at 07:13:59PM -0700, Frank Rowand wrote:
> > If you want to use vice grips as a hammer, screwdriver, monkey wrench,
> > etc.  there's nothing stopping you from doing that.  But it's not fair
> > to object to other people who might want to use better tools.
> > 
> > The reality is that we have a lot of testing tools.  It's not just
> > kselftests.  There is xfstests for file system code, blktests for
> > block layer tests, etc.   We use the right tool for the right job.
> 
> More specious arguments.

Well, *I* don't think they are specious; so I think we're going to
have to agree to disagree.

Cheers,

- Ted
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v2 00/17] kunit: introduce KUnit, the Linux kernel unit testing framework

2019-05-08 Thread Theodore Ts'o
On Wed, May 08, 2019 at 05:43:35PM -0700, Frank Rowand wrote:
> kselftest provides a mechanism for in-kernel tests via modules.  For
> example, see:
> 
>   tools/testing/selftests/vm/run_vmtests invokes:
> tools/testing/selftests/vm/test_vmalloc.sh
>   loads module:
> test_vmalloc
> (which is built from lib/test_vmalloc.c if CONFIG_TEST_VMALLOC)

The majority of the kselftests are implemented as userspace programs.

You *can* run in-kernel test using modules; but there is no framework
for the in-kernel code found in the test modules, which means each of
the in-kernel code has to create their own in-kernel test
infrastructure.  

That's much like saying you can use vice grips to turn a nut or
bolt-head.  You *can*, but it might be that using a monkey wrench
would be a much better tool that is much easier.

What would you say to a wood worker objecting that a toolbox should
contain a monkey wrench because he already knows how to use vise
grips, and his tiny brain shouldn't be forced to learn how to use a
wrench when he knows how to use a vise grip, which is a perfectly good
tool?

If you want to use vice grips as a hammer, screwdriver, monkey wrench,
etc.  there's nothing stopping you from doing that.  But it's not fair
to object to other people who might want to use better tools.

The reality is that we have a lot of testing tools.  It's not just
kselftests.  There is xfstests for file system code, blktests for
block layer tests, etc.   We use the right tool for the right job.

- Ted
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v2 00/17] kunit: introduce KUnit, the Linux kernel unit testing framework

2019-05-08 Thread Theodore Ts'o
On Wed, May 08, 2019 at 05:58:49PM -0700, Frank Rowand wrote:
> 
> If KUnit is added to the kernel, and a subsystem that I am submitting
> code for has chosen to use KUnit instead of kselftest, then yes, I do
> *have* to use KUnit if my submission needs to contain a test for the
> code unless I want to convince the maintainer that somehow my case
> is special and I prefer to use kselftest instead of KUnittest.

That's going to be between you and the maintainer.  Today, if you want
to submit a substantive change to xfs or ext4, you're going to be
asked to create test for that new feature using xfstests.  It doesn't
matter that xfstests isn't in the kernel --- if that's what is
required by the maintainer.

> > supposed to be a simple way to run a large number of small tests that
> > for specific small components in a system.
> 
> kselftest also supports running a subset of tests.  That subset of tests
> can also be a large number of small tests.  There is nothing inherent
> in KUnit vs kselftest in this regard, as far as I am aware.

The big difference is that kselftests are driven by a C program that
runs in userspace.  Take a look at 
tools/testing/selftests/filesystem/dnotify_test.c
it has a main(int argc, char *argv) function.

In contrast, KUnit are fragments of C code which run in the kernel;
not in userspace.  This allows us to test internal functions inside
complex file system (such as the block allocator in ext4) directly.
This makes it *fundamentally* different from kselftest.

Cheers,

- Ted
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v2 00/17] kunit: introduce KUnit, the Linux kernel unit testing framework

2019-05-07 Thread Theodore Ts'o
On Tue, May 07, 2019 at 10:01:19AM +0200, Greg KH wrote:
> > My understanding is that the intent of KUnit is to avoid booting a kernel on
> > real hardware or in a virtual machine.  That seems to be a matter of 
> > semantics
> > to me because isn't invoking a UML Linux just running the Linux kernel in
> > a different form of virtualization?
> > 
> > So I do not understand why KUnit is an improvement over kselftest.
> > 
> > It seems to me that KUnit is just another piece of infrastructure that I
> > am going to have to be familiar with as a kernel developer.  More overhead,
> > more information to stuff into my tiny little brain.
> > 
> > I would guess that some developers will focus on just one of the two test
> > environments (and some will focus on both), splitting the development
> > resources instead of pooling them on a common infrastructure.
> > 
> > What am I missing?
> 
> kselftest provides no in-kernel framework for testing kernel code
> specifically.  That should be what kunit provides, an "easy" way to
> write in-kernel tests for things.
> 
> Brendan, did I get it right?

Yes, that's basically right.  You don't *have* to use KUnit.  It's
supposed to be a simple way to run a large number of small tests that
for specific small components in a system.

For example, I currently use xfstests using KVM and GCE to test all of
ext4.  These tests require using multiple 5 GB and 20GB virtual disks,
and it works by mounting ext4 file systems and exercising ext4 through
the system call interfaces, using userspace tools such as fsstress,
fsx, fio, etc.  It requires time overhead to start the VM, create and
allocate virtual disks, etc.  For example, to run a single 3 seconds
xfstest (generic/001), it requires full 10 seconds to run it via
kvm-xfstests.

KUnit is something else; it's specifically intended to allow you to
create lightweight tests quickly and easily, and by reducing the
effort needed to write and run unit tests, hopefully we'll have a lot
more of them and thus improve kernel quality.

As an example, I have a volunteer working on developing KUinit tests
for ext4.  We're going to start by testing the ext4 extent status
tree.  The source code is at fs/ext4/extent_status.c; it's
approximately 1800 LOC.  The Kunit tests for the extent status tree
will exercise all of the corner cases for the various extent status
tree functions --- e.g., ext4_es_insert_delayed_block(),
ext4_es_remove_extent(), ext4_es_cache_extent(), etc.  And it will do
this in isolation without our needing to create a test file system or
using a test block device.

Next we'll test the ext4 block allocator, again in isolation.  To test
the block allocator we will have to write "mock functions" which
simulate reading allocation bitmaps from disk.  Again, this will allow
the test writer to explicitly construct corner cases and validate that
the block allocator works as expected without having to reverese
engineer file system data structures which will force a particular
code path to be executed.

So this is why it's largely irrelevant to me that KUinit uses UML.  In
fact, it's a feature.  We're not testing device drivers, or the
scheduler, or anything else architecture-specific.  UML is not about
virtualization.  What it's about in this context is allowing us to
start running test code as quickly as possible.  Booting KVM takes
about 3-4 seconds, and this includes initializing virtio_scsi and
other device drivers.  If by using UML we can hold the amount of
unnecessary kernel subsystem initialization down to the absolute
minimum, and if it means that we can communicating to the test
framework via a userspace "printf" from UML/KUnit code, as opposed to
via a virtual serial port to KVM's virtual console, it all makes for
lighter weight testing.

Why did I go looking for a volunteer to write KUnit tests for ext4?
Well, I have a plan to make some changes in restructing how ext4's
write path works, in order to support things like copy-on-write, a
more efficient delayed allocation system, etc.  This will require
making changes to the extent status tree, and by having unit tests for
the extent status tree, we'll be able to detect any bugs that we might
accidentally introduce in the es tree far more quickly than if we
didn't have those tests available.  Google has long found that having
these sorts of unit tests is a real win for developer velocity for any
non-trivial code module (or C++ class), even when you take into
account the time it takes to create the unit tests.

- Ted

P.S.  Many thanks to Brendan for finding such a volunteer for me; the
person in question is a SRE from Switzerland who is interested in
getting involved with kernel testing, and this is going to be their
20% project.  :-)

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[git pull] drm fixes for 4.9-rc4

2016-11-04 Thread Theodore Ts'o
On Fri, Nov 04, 2016 at 01:38:25PM -0700, Linus Torvalds wrote:
> On Wed, Nov 2, 2016 at 5:31 PM, Dave Airlie  wrote:
> >
> > There are a set of fixes for an oops we've been seeing around MST
> > display unplug,
> 
> Side note: I heard from a couple of people at the KS that said that
> they had had problems with suspend/resume after plugging in to a 4k
> display (but _only_ a 4k display - apparently normal FHD displays
> didn't show this). I think at least one was USB3/Thunderbolt. Ted with
> a Lenovo laptop (intel GPU) was one, I forget who else mentioned this.

Actually, it's after a unplugging from a Dell 30" monitor with a 3k
display (2560 x 1920).  This is after I've carefully deactivated the
video output to the Dell 30" monitor, unplugged the Dell 30" monitor
(at which point the system becomes non-responsive for 2-3 seconds for
reasons unknown), and only suspending after the system has recovered
from the unplug.

At that point, it's a 20-30% chance that the system will never come
back after a suspend.  So I have to make a point of saving all of my
editor buffers, etc., since I never can know whether my laptop will
come back.

This was happening for years and years on the T540p laptop, as well as
my new T460 laptop.  I've complained about this in the past, and
gotten no response, and I've just gotten used to the fact that if I'm
transitioning from home (where I have the 30" display) to work,
there's a good chance the resume will lock up, and I will be forced to
push the power button for 8 seconds to forcibly power down the laptop
to recover from the suspend.  :-(

I agree with Linus's suspicion that I probably need to bite the bullet
and just buy a new SST monitor, and that will probably make the
problem go away.  But if the bug can be fixed, that would be really
great.

Thanks,

- Ted


[REGRESSION] Re: i915 driver crashes on T540p if docking station attached

2015-08-03 Thread Theodore Ts'o
On Mon, Aug 03, 2015 at 10:24:53AM -0700, Linus Torvalds wrote:
> On Mon, Aug 3, 2015 at 9:25 AM, Theodore Ts'o  wrote:
> >
> > I've just tried pulling in your updated fixes-stuff, and it avoids the
> > oops and allows external the monitor to work correctly.
> 
> Good. Have either of you tested the suspend/resume behavior? Is that fixed 
> too?

No, I haven't had a chance to test the suspend/resume behavior,
because that requires suspending at work, going home, and connecting
to a dock which has a different monitor attached to it, and resuming
(or vice versa of suspending at home and then resuming at work).

So it's a bit trickier for me to test.  It's also not a regression,
and the workaround of rebooting is annoying, but I've lived with it
for several releases now, but I'll try the two patches/changes that
folks had suggested hopefully later this week.

- Ted


[REGRESSION] Re: i915 driver crashes on T540p if docking station attached

2015-08-03 Thread Theodore Ts'o
On Mon, Aug 03, 2015 at 05:27:29PM +0200, Daniel Vetter wrote:
> 
> Ok I updated fixes-stuff with just 2 patches which seem to be enough to
> fix it. Plus a patch to convert Linus' hack into something we can keep
> plus a drive-by WARNING fix in mst that got in the way for me.
> 
> Seems to work here in getting rid of the Oops. If this tests out for you
> too I'll send a pull to Linus.

I've just tried pulling in your updated fixes-stuff, and it avoids the
oops and allows external the monitor to work correctly.  However, I'm
still seeing a large number of drm/i915 related warning messages and
other kernel kvetching.

Thanks!!

- Ted

[4.084198] [drm] Initialized drm 1.1.0 20060810
[4.129576] [drm] Memory usable by graphics device = 2048M
[4.129616] [drm] Replacing VGA console driver
[4.130315] Console: switching to colour dummy device 80x25
[4.145332] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[4.145334] [drm] Driver supports precise vblank timestamp query.
[4.146184] vgaarb: device changed decodes: 
PCI::00:02.0,olddecodes=io+mem,decodes=io+mem:owns=io+mem
[4.163778] usbcore: registered new interface driver btusb
[4.170719] [ cut here ]
[4.170749] WARNING: CPU: 0 PID: 463 at 
/usr/projects/linux/linux/drivers/gpu/drm/i915/intel_pm.c:2339 
ilk_update_wm+0x71a/0xb27 [i915]()
[4.170751] WARN_ON(!r->enable)
[4.170752] Modules linked in:
[4.170754]  btusb btrtl btbcm btintel iwlmvm(+) bluetooth mac80211 iwlwifi 
snd_hda_intel i915(+) drm_kms_helper snd_hda_codec cfg80211 drm snd_hwdep 
lpc_ich snd_hda_core intel_gtt thinkpad_acpi tpm_tis nvram tpm 
intel_smartconnect uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core 
sch_fq_codel kvm_intel kvm ecryptfs parport_pc ppdev lp parport autofs4 btrfs 
xor hid_generic usbhid hid raid6_pq microcode rtsx_pci_sdmmc ehci_pci e1000e 
rtsx_pci ehci_hcd xhci_pci ptp mfd_core pps_core xhci_hcd
[4.170805] CPU: 0 PID: 463 Comm: systemd-udevd Not tainted 
4.2.0-rc5-14194-g130583b #18
[4.170807] Hardware name: LENOVO 20BECTO1WW/20BECTO1WW, BIOS GMET59WW (2.07 
) 02/12/2014
[4.170809]  0009 880403f0f4c8 8161aaee 
0006
[4.170814]  880403f0f518 880403f0f508 8107e5f0 
0006
[4.170818]  c05ade43 8800c8b7 8800c7f16000 
880405fb48b8
[4.170823] Call Trace:
[4.170829]  [] dump_stack+0x4c/0x65
[4.170833]  [] warn_slowpath_common+0xa1/0xbb
[4.170856]  [] ? ilk_update_wm+0x71a/0xb27 [i915]
[4.170859]  [] warn_slowpath_fmt+0x46/0x48
[4.170879]  [] ? ilk_compute_wm_maximums+0x43/0xa2 [i915]
[4.170899]  [] ilk_update_wm+0x71a/0xb27 [i915]
[4.170921]  [] intel_update_watermarks+0x1e/0x20 [i915]
[4.170957]  [] haswell_crtc_disable+0x270/0x2ae [i915]
[4.170989]  [] intel_crtc_control+0xa0/0xe1 [i915]
[4.171020]  [] intel_crtc_update_dpms+0x4d/0x5d [i915]
[4.171052]  [] intel_modeset_setup_hw_state+0x7b0/0xa90 
[i915]
[4.171081]  [] ? hsw_write64+0xcd/0xcd [i915]
[4.171113]  [] ? ilk_fbc_disable+0x29/0x69 [i915]
[4.171142]  [] intel_modeset_init+0x130d/0x14e3 [i915]
[4.171179]  [] i915_driver_load+0xf05/0x1139 [i915]
[4.171183]  [] ? mark_held_locks+0x56/0x6c
[4.171186]  [] ? _raw_spin_unlock_irqrestore+0x3f/0x4d
[4.171189]  [] ? trace_hardirqs_on_caller+0x171/0x18d
[4.171204]  [] drm_dev_register+0x84/0xfd [drm]
[4.171215]  [] drm_get_pci_dev+0x102/0x1bc [drm]
[4.171237]  [] i915_pci_probe+0x4f/0x51 [i915]
[4.171240]  [] pci_device_probe+0x74/0xd6
[4.171245]  [] ? driver_probe_device+0x387/0x387
[4.171248]  [] driver_probe_device+0x15f/0x387
[4.171250]  [] ? driver_probe_device+0x387/0x387
[4.171252]  [] __driver_attach+0x53/0x74
[4.171255]  [] bus_for_each_dev+0x6f/0x89
[4.171257]  [] driver_attach+0x1e/0x20
[4.171260]  [] bus_add_driver+0x140/0x238
[4.171263]  [] driver_register+0x8f/0xcc
[4.171266]  [] __pci_register_driver+0x5e/0x62
[4.171268]  [] ? 0xc069c000
[4.171278]  [] drm_pci_init+0x58/0xda [drm]
[4.171281]  [] ? 0xc069c000
[4.171301]  [] i915_init+0xa0/0xa8 [i915]
[4.171303]  [] ? 0xc069c000
[4.171307]  [] do_one_initcall+0x19a/0x1af
[4.171310]  [] ? do_init_module+0x28/0x1e3
[4.171313]  [] ? kmem_cache_alloc_trace+0xba/0xcc
[4.171315]  [] do_init_module+0x60/0x1e3
[4.171319]  [] load_module+0x1c42/0x2059
[4.171324]  [] SyS_finit_module+0x85/0x92
[4.171327]  [] entry_SYSCALL_64_fastpath+0x16/0x73
[4.171329] ---[ end trace 7eb514b89de5fc4a ]---
[4.171331] [ cut here ]
[4.171354] WARNING: CPU: 0 PID: 463 at 
/usr/projects/linux/linux/drivers/gpu/drm/i915/intel_pm.c:2339 
ilk_update_wm+0x71a/0xb27 [i915]()
[4.171355] WARN_ON(!r->enable)
[4.171357] Modules linked in:
[4.171358]  btusb 

[REGRESSION] Re: i915 driver crashes on T540p if docking station attached

2015-07-30 Thread Theodore Ts'o
On Thu, Jul 30, 2015 at 11:50:29AM -0400, Theodore Ts'o wrote:
> I've tried pulling in your patches from fixes-stuff, onto Linus's tree
> (without Linus's fix), and the good news is that I'm no longer
> crashing on boot.
> 
> The *bad* news is that (a) it breaks the external monitor attached to
> the docking station completely (this was working with Linus's patch),
> and (b) it's triggering a LOCKDEP failure.

Well, that's not fair.  Even with Linus's fix, there is still a
LOCKDEP failure.  And a few more i915 WARNINGS.  But at least the
external monitor works, so this is what I'm using.  Enclosed please
find a dmesg with the lockdep and i915 warnings and my .config.  The
kernel that I used can be found at:

https://git.kernel.org/cgit/linux/kernel/git/tytso/ext4.git/log/?h=i915-test-4.2.0-rc4

- Ted
-- next part --
A non-text attachment was scrubbed...
Name: dmesg.gz
Type: application/gzip
Size: 25784 bytes
Desc: not available
URL: 
<http://lists.freedesktop.org/archives/dri-devel/attachments/20150730/5b475051/attachment-0002.bin>
-- next part --
A non-text attachment was scrubbed...
Name: config.gz
Type: application/gzip
Size: 31022 bytes
Desc: not available
URL: 
<http://lists.freedesktop.org/archives/dri-devel/attachments/20150730/5b475051/attachment-0003.bin>


[REGRESSION] Re: i915 driver crashes on T540p if docking station attached

2015-07-30 Thread Theodore Ts'o
On Thu, Jul 30, 2015 at 04:40:02PM +0200, Daniel Vetter wrote:
> I have 4 patches in git://people.freedesktop.org/~danvet/drm fixes-stuff
> but I couldn't test them yet since no dp mst here and I didn't find
> anything that would ship faster than 1-2 weeks yet. I'll try to get some
> other people here to test it meanwhile too.

I've tried pulling in your patches from fixes-stuff, onto Linus's tree
(without Linus's fix), and the good news is that I'm no longer
crashing on boot.

The *bad* news is that (a) it breaks the external monitor attached to
the docking station completely (this was working with Linus's patch),
and (b) it's triggering a LOCKDEP failure.

So even though Linus's patch wasn't supposed to work, I think I'm
going to back to it

- Ted


Jul 30 11:46:49 closure kernel: [4.221951] 
Jul 30 11:46:49 closure kernel: [4.221954] 
==
Jul 30 11:46:49 closure kernel: [4.221957] [ INFO: possible circular 
locking dependency detected ]
Jul 30 11:46:49 closure kernel: [4.221960] 4.2.0-rc4-13906-g5f1b75cd #16 
Not tainted
Jul 30 11:46:49 closure kernel: [4.221963] 
---
Jul 30 11:46:49 closure kernel: [4.221966] modprobe/503 is trying to 
acquire lock:
Jul 30 11:46:49 closure kernel: [4.221968]  (init_mutex){+.+.+.}, at: 
[] acpi_video_get_backlight_type+0x17/0x164
Jul 30 11:46:49 closure kernel: [4.221977] 
Jul 30 11:46:49 closure kernel: [4.221977] but task is already holding lock:
Jul 30 11:46:49 closure kernel: [4.221979]  
(&(_notifier)->rwsem){..}, at: [] 
__blocking_notifier_call_chain+0x37/0x69
Jul 30 11:46:49 closure kernel: [4.221987] 
Jul 30 11:46:49 closure kernel: [4.221987] which lock already depends on 
the new lock.
Jul 30 11:46:49 closure kernel: [4.221987] 
Jul 30 11:46:49 closure kernel: [4.221990] 
Jul 30 11:46:49 closure kernel: [4.221990] the existing dependency chain 
(in reverse order) is:
Jul 30 11:46:49 closure kernel: [4.221995] 
Jul 30 11:46:49 closure kernel: [4.221995] -> #1 
(&(_notifier)->rwsem){..}:
Jul 30 11:46:49 closure kernel: [4.222001][] 
lock_acquire+0x104/0x18b
Jul 30 11:46:49 closure kernel: [4.222007][] 
down_write+0x46/0x8a
Jul 30 11:46:49 closure kernel: [4.222012][] 
blocking_notifier_chain_register+0x36/0x57
Jul 30 11:46:49 closure kernel: [4.222017][] 
backlight_register_notifier+0x18/0x1a
Jul 30 11:46:49 closure kernel: [4.222022][] 
acpi_video_get_backlight_type+0xfa/0x164
Jul 30 11:46:49 closure kernel: [4.222028][] 
0xc03a1e45
Jul 30 11:46:49 closure audispd: No plugins found, exiting
Jul 30 11:46:49 closure kernel: [4.222032][] 
0xc03a28a8
Jul 30 11:46:49 closure kernel: [4.222036][] 
do_one_initcall+0x19a/0x1af
Jul 30 11:46:49 closure kernel: [4.222042][] 
do_init_module+0x60/0x1e3
Jul 30 11:46:49 closure kernel: [4.222047][] 
load_module+0x1c42/0x2059
Jul 30 11:46:49 closure kernel: [4.222052][] 
SyS_finit_module+0x85/0x92
Jul 30 11:46:49 closure kernel: [4.222056][] 
entry_SYSCALL_64_fastpath+0x16/0x73
Jul 30 11:46:49 closure kernel: [4.222060] 
Jul 30 11:46:49 closure kernel: [4.222060] -> #0 (init_mutex){+.+.+.}:
Jul 30 11:46:49 closure kernel: [4.222065][] 
__lock_acquire+0xc55/0xf54
Jul 30 11:46:49 closure kernel: [4.222070][] 
lock_acquire+0x104/0x18b
Jul 30 11:46:49 closure kernel: [4.222074][] 
mutex_lock_nested+0x70/0x391
Jul 30 11:46:49 closure kernel: [4.222078][] 
acpi_video_get_backlight_type+0x17/0x164
Jul 30 11:46:49 closure kernel: [4.222083][] 
acpi_video_backlight_notify+0x19/0x2f
Jul 30 11:46:49 closure kernel: [4.222088][] 
notifier_call_chain+0x4c/0x71
Jul 30 11:46:49 closure kernel: [4.222092][] 
__blocking_notifier_call_chain+0x50/0x69
Jul 30 11:46:49 closure kernel: [4.222098][] 
blocking_notifier_call_chain+0x14/0x16
Jul 30 11:46:49 closure kernel: [4.222103][] 
backlight_device_register+0x1df/0x1f1
Jul 30 11:46:49 closure kernel: [4.222108][] 
intel_backlight_register+0xf0/0x157 [i915]
Jul 30 11:46:49 closure kernel: [4.222146][] 
intel_modeset_gem_init+0x158/0x164 [i915]
Jul 30 11:46:49 closure kernel: [4.222176][] 
i915_driver_load+0xf1c/0x1139 [i915]
Jul 30 11:46:49 closure kernel: [4.05][] 
drm_dev_register+0x84/0xfd [drm]
Jul 30 11:46:49 closure kernel: [4.17][] 
drm_get_pci_dev+0x102/0x1bc [drm]
Jul 30 11:46:49 closure kernel: [4.28][] 
i915_pci_probe+0x4f/0x51 [i915]
Jul 30 11:46:49 closure kernel: [4.47][] 
pci_device_probe+0x74/0xd6
Jul 30 11:46:49 closure kernel: [4.53][] 
driver_probe_device+0x15f/0x387
Jul 30 

[REGRESSION] Re: i915 driver crashes on T540p if docking station attached

2015-07-30 Thread Theodore Ts'o
On Thu, Jul 30, 2015 at 04:40:02PM +0200, Daniel Vetter wrote:
> On Wed, Jul 29, 2015 at 10:18:16PM -0700, Linus Torvalds wrote:
> >  drivers/gpu/drm/drm_atomic_helper.c | 8 +---
> >  1 file changed, 5 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/drm_atomic_helper.c 
> > b/drivers/gpu/drm/drm_atomic_helper.c
> > index 5b59d5ad7d1c..aac212297b49 100644
> > --- a/drivers/gpu/drm/drm_atomic_helper.c
> > +++ b/drivers/gpu/drm/drm_atomic_helper.c
> > @@ -230,10 +230,12 @@ update_connector_routing(struct drm_atomic_state 
> > *state, int conn_idx)
> > }
> >  
> > connector_state->best_encoder = new_encoder;
> > -   idx = drm_crtc_index(connector_state->crtc);
> > +   if (connector_state->crtc) {
> > +   idx = drm_crtc_index(connector_state->crtc);
> >  
> > -   crtc_state = state->crtc_states[idx];
> > -   crtc_state->mode_changed = true;
> > +   crtc_state = state->crtc_states[idx];
> > +   crtc_state->mode_changed = true;
> > +   }
> 
> This shouldn't happen since if it does we ended up stealing the encoder
> from the connector itself (we do check for connector_state->crtc earlier)
> and that would be a bug. I haven't figured out a precise theory but my
> guess is on the best_encoder selection, and indeed dp mst encoder
> selection seems to have gone belly up in 4.2 with the bisected commit.

Well, I just tested Linus's patch and it works.

BTW, is there any chance that I can suspend my laptop, and then move
it from my docking station at home (where I have a Dell 30" display)
to my docking station at work (where I have a Dell 24" display), and
actually have the new monitor be detected?  For at least the past
year, I have to reboot in order to be able to use the external
monitor?  This used to work, but it's been a very long-standing
regression.  I undrstand that Multi-stream DP is a evil horrible hack,
and supporting it is painful, but this used to work, and it hasn't in
a long time.  :-(

- Ted


[REGRESSION] Re: i915 driver crashes on T540p if docking station attached

2015-07-29 Thread Theodore Ts'o
On Wed, Jul 29, 2015 at 08:49:37PM -0400, Theodore Ts'o wrote:
> 
> Unfortunately the failure causes a series of recursive faults and I
> haven't been able to capture the stack trace, but on 4.2-rcX kernels,
> I can reliably cause the system to crash if my T540p is booted with
> the docking station attached.
> 
> It will also crash if I boot the system first, and then insert the
> laptop into the dockstation.
> 
> Unfortunately, I can't get a stack trace because there are a huge
> number of recursive/double faults, and the system dies so quickly that
> nothing ends up in the log files.  If you really need a stack dump I
> can try to rig something, but modern Laptops don't have serial
> consoles any more, alas, so it's bit of a pain.

The bad news is that I tried to use kdump to capture a crashdump and
hopefully get more information, and kdump utterly wedged on the panic.
The good news is because it wedged the system, I was able to get the
console stackdump before it scrolled off due to a whole series of
recursive oops messages.

It's here:  https://goo.gl/photos/xHjn2Z97JQEw6k2C9

Hopefully tihs is useful.  It's not obvious how to revert this change,
since there were a large number of changes to i915 after this.  If
someone could help me with a revert, I'd be happy to test it.

Thanks,

- Ted



> 
> I was able to bisect it down to this commit, however: 8c7b5ccb72987:
> "drm/i915: Use atomic helpers for computing changed flags:"
> 
> Is there any chance Intel could add a Lenovo Dockstation with a
> Multistream DP output to part of your test hardware?  Unfortunately it
> seems pretty common that I see regressions with my particular
> hardware.  Maybe there aren't enough people using Thinkpads any more?  :-(
> 
> - Ted
> 
> 
> P.S.  The git bisect log
> 
> git bisect start
> # bad: [421d125c06c4be4c5005cb69840206bd09b71dd6] builddeb: sign the modules 
> after splitting out the debuginfo files
> git bisect bad 421d125c06c4be4c5005cb69840206bd09b71dd6
> # good: [b953c0d234bc72e8489d3bf51a276c5c4ec85345] Linux 4.1
> git bisect good b953c0d234bc72e8489d3bf51a276c5c4ec85345
> # good: [aeaa2122af4e53f3bfd28e8f294557bb95af43fc] drm/i915/skl: Add the INIT 
> power domain to the MISC I/O power well
> git bisect good aeaa2122af4e53f3bfd28e8f294557bb95af43fc
> # bad: [4d70f38a760ad2879d2ebd84001c92980180f630] drm/i915/bios: remove a 
> redundant NULL pointer check
> git bisect bad 4d70f38a760ad2879d2ebd84001c92980180f630
> # bad: [27a1b688d9f1fa2abd14bfe6a8729a19fb3b1b25] drm/i915/bxt: Enable 
> WaEnableYV12BugFixInHalfSliceChicken7 for Broxton
> git bisect bad 27a1b688d9f1fa2abd14bfe6a8729a19fb3b1b25
> # good: [4be0731786de10d0e9ae1d159504c83c6b052647] drm/i915: Add crtc states 
> before calling compute_config()
> git bisect good 4be0731786de10d0e9ae1d159504c83c6b052647
> # good: [d5432a9d19b61ba6a2b3d88f3026e0ca60eb57a1] drm/i915: Stage new 
> modeset state straight into atomic state
> git bisect good d5432a9d19b61ba6a2b3d88f3026e0ca60eb57a1
> # bad: [a821fc46bc7bb6d4cf9a5f8d2787fd70231c2c10] drm/i915: Swap atomic state 
> in legacy modeset
> git bisect bad a821fc46bc7bb6d4cf9a5f8d2787fd70231c2c10
> # bad: [8c7b5ccb729870e606321b3703e2c2e698c49a95] drm/i915: Use atomic 
> helpers for computing changed flags
> git bisect bad 8c7b5ccb729870e606321b3703e2c2e698c49a95
> # good: [0f63cca2afdc38877e86acfa9821020f6e2213fd] drm/i915: Update crtc 
> state active flag based on DPMS
> git bisect good 0f63cca2afdc38877e86acfa9821020f6e2213fd
> # good: [840bfe953384a134c8639f2964d9b74bfa671e16] drm/atomic: Make 
> mode_fixup() optional for check_modeset()
> git bisect good 840bfe953384a134c8639f2964d9b74bfa671e16
> # first bad commit: [8c7b5ccb729870e606321b3703e2c2e698c49a95] drm/i915: Use 
> atomic helpers for computing changed flags
> 


i915 driver crashes on T540p if docking station attached

2015-07-29 Thread Theodore Ts'o

Unfortunately the failure causes a series of recursive faults and I
haven't been able to capture the stack trace, but on 4.2-rcX kernels,
I can reliably cause the system to crash if my T540p is booted with
the docking station attached.

It will also crash if I boot the system first, and then insert the
laptop into the dockstation.

Unfortunately, I can't get a stack trace because there are a huge
number of recursive/double faults, and the system dies so quickly that
nothing ends up in the log files.  If you really need a stack dump I
can try to rig something, but modern Laptops don't have serial
consoles any more, alas, so it's bit of a pain.

I was able to bisect it down to this commit, however: 8c7b5ccb72987:
"drm/i915: Use atomic helpers for computing changed flags:"

Is there any chance Intel could add a Lenovo Dockstation with a
Multistream DP output to part of your test hardware?  Unfortunately it
seems pretty common that I see regressions with my particular
hardware.  Maybe there aren't enough people using Thinkpads any more?  :-(

  - Ted


P.S.  The git bisect log

git bisect start
# bad: [421d125c06c4be4c5005cb69840206bd09b71dd6] builddeb: sign the modules 
after splitting out the debuginfo files
git bisect bad 421d125c06c4be4c5005cb69840206bd09b71dd6
# good: [b953c0d234bc72e8489d3bf51a276c5c4ec85345] Linux 4.1
git bisect good b953c0d234bc72e8489d3bf51a276c5c4ec85345
# good: [aeaa2122af4e53f3bfd28e8f294557bb95af43fc] drm/i915/skl: Add the INIT 
power domain to the MISC I/O power well
git bisect good aeaa2122af4e53f3bfd28e8f294557bb95af43fc
# bad: [4d70f38a760ad2879d2ebd84001c92980180f630] drm/i915/bios: remove a 
redundant NULL pointer check
git bisect bad 4d70f38a760ad2879d2ebd84001c92980180f630
# bad: [27a1b688d9f1fa2abd14bfe6a8729a19fb3b1b25] drm/i915/bxt: Enable 
WaEnableYV12BugFixInHalfSliceChicken7 for Broxton
git bisect bad 27a1b688d9f1fa2abd14bfe6a8729a19fb3b1b25
# good: [4be0731786de10d0e9ae1d159504c83c6b052647] drm/i915: Add crtc states 
before calling compute_config()
git bisect good 4be0731786de10d0e9ae1d159504c83c6b052647
# good: [d5432a9d19b61ba6a2b3d88f3026e0ca60eb57a1] drm/i915: Stage new modeset 
state straight into atomic state
git bisect good d5432a9d19b61ba6a2b3d88f3026e0ca60eb57a1
# bad: [a821fc46bc7bb6d4cf9a5f8d2787fd70231c2c10] drm/i915: Swap atomic state 
in legacy modeset
git bisect bad a821fc46bc7bb6d4cf9a5f8d2787fd70231c2c10
# bad: [8c7b5ccb729870e606321b3703e2c2e698c49a95] drm/i915: Use atomic helpers 
for computing changed flags
git bisect bad 8c7b5ccb729870e606321b3703e2c2e698c49a95
# good: [0f63cca2afdc38877e86acfa9821020f6e2213fd] drm/i915: Update crtc state 
active flag based on DPMS
git bisect good 0f63cca2afdc38877e86acfa9821020f6e2213fd
# good: [840bfe953384a134c8639f2964d9b74bfa671e16] drm/atomic: Make 
mode_fixup() optional for check_modeset()
git bisect good 840bfe953384a134c8639f2964d9b74bfa671e16
# first bad commit: [8c7b5ccb729870e606321b3703e2c2e698c49a95] drm/i915: Use 
atomic helpers for computing changed flags



WARNING: /usr/projects/linux/linux/drivers/gpu/drm/i915/intel_pm.c:6585 intel_display_power_put+0x4b/0x116 [i915]()

2014-12-15 Thread Theodore Ts'o
This morning, I was docked and using a Dell 30" monitor.  I
reconfigured the X server to stop sending video to the external
monitor, suspended the laptop, and after it was suspended undocked it
and took it to work.  Then I docked it at work, where it was connected
to a powered off Dell 24" monitor.

Since this time there was a note:

[drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including 
userspace.
[drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> 
DRM/Intel
[drm] drm/i915 developers can then reassign to the right component if it's not 
a kernel issue.
[drm] The gpu crash dump is required to analyze gpu hangs, so please always 
attach it.
[drm] GPU crash dump saved to /sys/class/drm/card0/error

I've filed:

https://bugs.freedesktop.org/show_bug.cgi?id=87327

I do want to note that this is only one of many different symptoms and
failures that I am seeing.

Basically, after suspending and resuming, if I use an external
monitor, I have to be prepared to restart the X server --- and
sometimes the kernel will sponaneously reboot, sometimes when I do
something as simple as accidentally brushing against the power switch
on the Dell montor.   :-(   :-(   :-(

- Ted


WARNING: /usr/projects/linux/linux/drivers/gpu/drm/i915/intel_pm.c:6585 intel_display_power_put+0x4b/0x116 [i915]()

2014-12-07 Thread Theodore Ts'o
On Mon, Dec 08, 2014 at 12:32:01PM +1000, Dave Airlie wrote:
> 
> I suspect a lot of the problems are just that xfce isn't sufficiently handling
> randr events, and it is getting out of sync, it is like hotplug networking
> before NetworkManager etc.

Yes, I've seen this on XFCE 4.11 (in Ubuntu), although I haen't seen
it in XFCE 4.10 (in Debian).  In 4.11, when xfce gets out of sync,
xrandr --auto will allow me to enable the display, even when
xfce4-display-settings does not.

However, on my T540p with the monitors connected via the dock, which
is the problem I was describing here, xrandr --auto does *not* fix
things up.  So I think it's a different problem, since I can see this
problem even if I don't use xfce4-display-settings, and just use
xrandr directly.

- Ted


WARNING: /usr/projects/linux/linux/drivers/gpu/drm/i915/intel_pm.c:6585 intel_display_power_put+0x4b/0x116 [i915]()

2014-12-07 Thread Theodore Ts'o
This is an update to a problem which I reported several months ago
(see below).  The symptoms have changed a bit since then, but they've
stablized since 3.17 and 3.18-rcX, and while annoying, it's tolerable,
so I've been living with it.

What I'm basically seeing now is that any external monitor (either a
Dell 30" or Dell 24") will be seen after a reboot or a restart of the
X server.  But if suspend the laptop, disconnect from the dock, and
resume, and then later on, reconnect to the dock, the external
monitors are not visible until I kill and restart the X server.
Another part of the symptom is when I try to probe for the monitors,
using either xrandr or xfce4-display-settings, the system freezes for
a second or two, and then when it recovers, if I look in the logs, I
see the following warning message, repeated twice:

[246289.614639] WARNING: CPU: 0 PID: 29875 at 
/usr/projects/linux/linux/drivers/gpu/drm/i915/intel_pm.c:6585 
intel_display_power_put+0x4b/0x116 [i915]()
[246289.614640] Modules linked in: iptable_filter ip_tables x_tables rfcomm ctr 
ccm binfmt_misc bnep hid_generic usbhid uvcvideo hid videobuf2_vmalloc 
videobuf2_memops videobuf2_core btusb bluetooth snd_hda_codec_hdmi 
x86_pkg_temp_thermal intel_powerclamp crc32_pclmul ghash_clmulni_intel pcspkr 
serio_raw i2c_i801 thinkpad_acpi nvram snd_hda_codec_realtek 
snd_hda_codec_generic tpm_tis iwlmvm mac80211 i915 drm_kms_helper drm iwlwifi 
intel_smartconnect intel_gtt snd_hda_intel cfg80211 lpc_ich mfd_core 
snd_hda_controller snd_hda_codec xhci_pci snd_hwdep xhci_hcd kvm_intel kvm 
ecryptfs encrypted_keys trusted tpm parport_pc ppdev lp parport autofs4 btrfs 
xor raid6_pq microcode ehci_pci ehci_hcd e1000e ptp pps_core
[246289.614675] CPU: 0 PID: 29875 Comm: Xorg Tainted: GW  
3.18.0-rc7-00028-g455e143 #108
[246289.614676] Hardware name: LENOVO 20BECTO1WW/20BECTO1WW, BIOS GMET59WW 
(2.07 ) 02/12/2014
[246289.614677]  0009 8802048e3a68 815cb640 
0006
[246289.614679]   8802048e3aa8 8106de00 
8802048e3aa8
[246289.614681]  a04a7441 880405610044 88040561 
880405438890
[246289.614683] Call Trace:
[246289.614689]  [] dump_stack+0x4e/0x68
[246289.614692]  [] warn_slowpath_common+0x81/0x9b
[246289.614702]  [] ? intel_display_power_put+0x4b/0x116 
[i915]
[246289.614704]  [] warn_slowpath_null+0x1a/0x1c
[246289.614713]  [] intel_display_power_put+0x4b/0x116 [i915]
[246289.614731]  [] 
modeset_update_crtc_power_domains+0xd8/0x117 [i915]
[246289.614746]  [] haswell_modeset_global_resources+0xe/0x10 
[i915]
[246289.614760]  [] __intel_set_mode+0x9a5/0x1204 [i915]
[246289.614775]  [] ? intel_crtc_set_config+0x151/0xa98 [i915]
[246289.614789]  [] intel_set_mode+0x16/0x2f [i915]
[246289.614802]  [] intel_crtc_set_config+0x7b2/0xa98 [i915]
[246289.614815]  [] drm_mode_set_config_internal+0x59/0xe5 
[drm]
[246289.614823]  [] drm_framebuffer_remove+0x8e/0x104 [drm]
[246289.614832]  [] drm_mode_rmfb+0xdc/0x101 [drm]
[246289.614839]  [] drm_ioctl+0x38a/0x429 [drm]
[246289.614847]  [] ? drm_mode_addfb2+0x30/0x30 [drm]
[246289.614849]  [] ? avc_has_perm+0x35/0x99
[246289.614852]  [] ? 
read_seqcount_begin.constprop.25+0x5a/0x70
[246289.614855]  [] do_vfs_ioctl+0x3ab/0x45e
[246289.614857]  [] ? file_has_perm+0x5d/0x81
[246289.614859]  [] ? __audit_syscall_entry+0xcd/0xef
[246289.614861]  [] SyS_ioctl+0x5a/0x7f
[246289.614864]  [] system_call_fastpath+0x16/0x1b
[246289.614865] ---[ end trace 238ba2a27a19ffce ]---

(The above stack trace came from a 3.17-rc7 based kernel.)

Looking at the code, we seem to be triggering on the following WARN_ON
in intel_display_power_put:

WARN_ON(!power_domains->domain_use_count[domain]);
power_domains->domain_use_count[domain]--;

This warning does *not* appear after a fresh reboot, logging in, and
then running xfce4-display-settings, so I assume that part of the
problem is that the internal kernel state is getting corrupted when
the laptop is undocked and the display disappears while the laptop is
suspend, and the internal state inconsistency / corruption persists
even though killing and restarting the X server seems to make things
the external display monitor become visible again.

This is consistent with the observation that sometimes the laptop will
lock up after either (a) resuming with the dock attached or (b) if the
external monitor is accidentally powered off (the power button on the
dell monitor is way too easy to accidentally hit), and the only way to
recover is with a magic sysrq reboot or a forced power cycle.

This is a not a regression, and it's been this way for several months
(including 3.17.0), but I've been too busy to really try to debug this
until now.

Hopefully the above is helpful; if there's anything more debugging
information I can get, let me know!!

- Ted


On Tue, Sep 02, 2014 at 12:05:27AM -0400, Theodore Ts'o wrote:
> 

[REGRESSION] i915: failure to see Dell 30" monitor connected to a Lenovo Haswell docking station

2014-09-25 Thread Theodore Ts'o
On Tue, Sep 02, 2014 at 02:15:39PM +1000, Dave Airlie wrote:
> On 2 September 2014 14:05, Theodore Ts'o  wrote:
> > I recently upgraded to v3.17-rc3, and on my Lenovo T540p, I can no
> > longer see the my Dell 30" monitor when it is connected via the
> > docking station using a Displayport connector.  This worked using 3.16
> > kernel.
> >
> > If I connect to the monitor using the mini-display, by passing the
> > docking station, things work fine (but of course it's annoying not to
> > be able to use the docking station).
> >
> > Is this a known problem?  This is not the first time that we've had
> > regressions with this docking station.   It's vaguely reminsicent of
> >
> > https://bugs.freedesktop.org/show_bug.cgi?id=71267
> >
> > Except the system isn't hanging; it's just not seeing the monitor at all.
> 
> Have you the Dell 30" set to Displayport 1.2 enabled mode?
> 
> If so, then see if disabling that in the monitor menus helps.
> 
> This is probably due to the fact we now attempt to talk to new DP devices
> with the protocol they provide. So previously the monitor exposed DP 1.2
> and we just didn't care, now if it exposes it we attempt to talk to it.

Hi Dave,

I've since upgraded to a newer X server, which may have been
responsible for the symptoms somewhat.  It now doesn't seem to matter
whether the Dell 30" monitor is set to DP 1.2 or not.  It now will
find the Dell 30" monitor reliably when the system is freshly booted,
attached to the Dock.  If I then suspend the laptop, remove it from
the dock, unsuspend it from the laptop, then resuspend the laptop, and
return it to the dock, it can no longer see the monitor until I
reboot.

I am currently running 3.17-rc4 based kernel, and I have the following
X server components:

ii  xserver-xorg1:7.7+7  amd64
X.Org X server
ii  xserver-xorg-core   2:1.16.0.901-1   amd64
Xorg X server - core server
ii  xserver-xorg-video-intel2:2.21.15-2+b2   amd64
X.Org X server -- Intel i8xx, i9xx display driver

Here is the dmesg file with drm.debug=6.  Could you take a quick look
and see if anything obvious jumps out at you?

Thanks,

- Ted

-- next part --
A non-text attachment was scrubbed...
Name: dmesg-repro.gz
Type: application/gzip
Size: 32179 bytes
Desc: not available
URL: 
<http://lists.freedesktop.org/archives/dri-devel/attachments/20140925/6255d4d4/attachment-0001.bin>


[REGRESSION] i915: failure to see Dell 30" monitor connected to a Lenovo Haswell docking station

2014-09-02 Thread Theodore Ts'o
On Tue, Sep 02, 2014 at 09:23:16PM +1000, Dave Airlie wrote:
> 
> Interesting, I have the same combo of hw available on my desk at work,
> but it might be a couple of days before I can get to the office to
> debug it,
> 
> can you boot with drm.debug=6 and get me the dmesg?

I'll do that when I get home.  In the meantime, here's an additional
data point.  At work, I have the same model docking station connected
to a 2011 Dell 2410f Rev A04 (max resolution 1920x1200, and I suspect
not DP 1.2 capable; at least, it doesn't mention DP in monitor menu)
--- and connecting through the docking station, it does work
(connecting through either DVI or DisplayPort).

Here's the drm.debug=6 connecting to the docking station via DVI.  I
can get a drm.debug=6 connecting via the DP and the docking station if
that would be helpful.  Similarly, if you want, I can also try to get
a debug run connecting to the HP ZRW30 monitor (either direct or via
the docking station), since that's the monitor on the walkstation.  :-)

Cheers,

- Ted

-- next part --
A non-text attachment was scrubbed...
Name: drm-debug.xz
Type: application/octet-stream
Size: 27396 bytes
Desc: not available
URL: 



[REGRESSION] i915: failure to see Dell 30" monitor connected to a Lenovo Haswell docking station

2014-09-02 Thread Theodore Ts'o
On Tue, Sep 02, 2014 at 02:15:39PM +1000, Dave Airlie wrote:
> On 2 September 2014 14:05, Theodore Ts'o  wrote:
> > I recently upgraded to v3.17-rc3, and on my Lenovo T540p, I can no
> > longer see the my Dell 30" monitor when it is connected via the
> > docking station using a Displayport connector.  This worked using 3.16
> > kernel.
> >
> > If I connect to the monitor using the mini-display, by passing the
> > docking station, things work fine (but of course it's annoying not to
> > be able to use the docking station).
> >
> > Is this a known problem?  This is not the first time that we've had
> > regressions with this docking station.   It's vaguely reminsicent of
> >
> > https://bugs.freedesktop.org/show_bug.cgi?id=71267
> >
> > Except the system isn't hanging; it's just not seeing the monitor at all.
> 
> Have you the Dell 30" set to Displayport 1.2 enabled mode?

No, it DP 1.2 was disabled.  If I enable it, it breaks things when I
try connecting via the MiniDP port (bypassing the dock), and it is
still broken when I try to talk via the DisplayPort in the dock.

If I disable DP 1.2 again, it works via the MiniDP port, but if I try
to connect through the dock (which has a DP Hub which I believe is
MST/DP 1.2 capable), it is still broken.

It does seem that this might be related to 3.17-rc3 trying to talk DP
1.2 if it is available, since I can't control what the DP hub in the
docking station advertises --- is there a commit or some kind of hack
I can try to force talking to the DP hub using DP 1.1?

- Ted


[REGRESSION] i915: failure to see Dell 30" monitor connected to a Lenovo Haswell docking station

2014-09-02 Thread Theodore Ts'o
I recently upgraded to v3.17-rc3, and on my Lenovo T540p, I can no
longer see the my Dell 30" monitor when it is connected via the
docking station using a Displayport connector.  This worked using 3.16
kernel.

If I connect to the monitor using the mini-display, by passing the
docking station, things work fine (but of course it's annoying not to
be able to use the docking station).

Is this a known problem?  This is not the first time that we've had
regressions with this docking station.   It's vaguely reminsicent of 

https://bugs.freedesktop.org/show_bug.cgi?id=71267

Except the system isn't hanging; it's just not seeing the monitor at all.

Thanks,

- Ted




[PATCH 0/6] File Sealing & memfd_create()

2014-04-10 Thread Theodore Ts'o
On Thu, Apr 10, 2014 at 12:14:27PM -0700, Andy Lutomirski wrote:
> 
> This is the second time in a week that someone has asked for a way to
> have a struct file (or struct inode or whatever) that can't be reopened
> through /proc/pid/fd.  This should be quite easy to implement as a
> separate feature.

What I suggested on a different thread was to add the following new
file descriptor flags, to join FD_CLOEXEC, which would be maniuplated
using the F_GETFD and F_SETFD fcntl commands:

FD_NOPROCFS disallow being able to open the inode via /proc//fd

FD_NOPASSFD disallow being able to pass the fd via a unix domain socket

FD_LOCKFLAGSif this bit is set, disallow any further changes of FD_CLOEXEC,
FD_NOPROCFS, FD_NOPASSFD, and FD_LOCKFLAGS flags.

Regardless of what else we might need to meet the use case for the
proposed File Sealing API, I think this is a useful feature that could
be used in many other contexts besides just the proposed
memfd_create() use case.

Cheers,

- Ted


[REGRESSION] i915: failure to interoperate with HP ZR30w using an X230

2012-11-03 Thread Theodore Ts'o
On Sat, Nov 03, 2012 at 11:21:58AM +0100, Daniel Vetter wrote:
> 
> Well, we know for sure that fdi link training is broken - it doesn't match
> at all what the spec says we should do. I've been working on this lately,
> since in quite a few circumstances the link train fails without the
> relevent bits indicating so. While testing I've also noticed that this
> entire thing is highly timing dependent, e.g. denpending upon which
> desktop is running and which tool I use to change the configuration it
> fails or succeeds.

So it is still (somewhat) broken in 3.7-rcX?  Certainly it seems to be
better (at least for me) than what was in 3.6.3.  Or are you saying I
may have just gotten lucky?  :-)

> So I have no suggestions for what could help your system and what should
> get backported, since the current code is still broken.

Thanks for your response and for your work in making the i915 driver
better.  I really appreciate your efforts.

Cheers,

- Ted


Re: [REGRESSION] i915: failure to interoperate with HP ZR30w using an X230

2012-11-03 Thread Theodore Ts'o
On Sat, Nov 03, 2012 at 11:21:58AM +0100, Daniel Vetter wrote:
 
 Well, we know for sure that fdi link training is broken - it doesn't match
 at all what the spec says we should do. I've been working on this lately,
 since in quite a few circumstances the link train fails without the
 relevent bits indicating so. While testing I've also noticed that this
 entire thing is highly timing dependent, e.g. denpending upon which
 desktop is running and which tool I use to change the configuration it
 fails or succeeds.

So it is still (somewhat) broken in 3.7-rcX?  Certainly it seems to be
better (at least for me) than what was in 3.6.3.  Or are you saying I
may have just gotten lucky?  :-)

 So I have no suggestions for what could help your system and what should
 get backported, since the current code is still broken.

Thanks for your response and for your work in making the i915 driver
better.  I really appreciate your efforts.

Cheers,

- Ted
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


[REGRESSION] i915: failure to interoperate with HP ZR30w using an X230

2012-11-02 Thread Theodore Ts'o
Ping?

On Tue, Oct 30, 2012 at 04:32:21PM -0400, Theodore Ts'o wrote:
> On Tue, Oct 30, 2012 at 01:57:27PM +0200, Jani Nikula wrote:
> > > [1] drm-intel-next-queued branch at 
> > > git://people.freedesktop.org/~danvet/drm-intel
> > 
> > Hmm, actually not. Either drm-intel-fixes branch, or Linus' master.
> 
> Confirmed, the drm-intel-fixes branch from Daniel's tree
> (3.7.0-rc2-00031-g1623392) works fine for me.
> 
> Do you know which commit(s) are likely to have fixed the problem, so we
> can cherry pick the appropriate fix(es) to the 3.6.x tree?
> 
> Thanks,
> 
>   - Ted


[REGRESSION] i915: failure to interoperate with HP ZR30w using an X230

2012-10-30 Thread Theodore Ts'o
On Tue, Oct 30, 2012 at 01:57:27PM +0200, Jani Nikula wrote:
> > [1] drm-intel-next-queued branch at 
> > git://people.freedesktop.org/~danvet/drm-intel
> 
> Hmm, actually not. Either drm-intel-fixes branch, or Linus' master.

Confirmed, the drm-intel-fixes branch from Daniel's tree
(3.7.0-rc2-00031-g1623392) works fine for me.

Do you know which commit(s) are likely to have fixed the problem, so we
can cherry pick the appropriate fix(es) to the 3.6.x tree?

Thanks,

- Ted


[REGRESSION] i915: failure to interoperate with HP ZR30w using an X230

2012-10-30 Thread Theodore Ts'o

I recently upgraded to 3.6.3, and my Lenovo X230 has stopped being able
to work with an HP ZR30w 30 2560x1600 display.   I saw the following
messages in the dmesg:

[drm:ivb_manual_fdi_link_train] *ERROR* FDI train 1 fail!
[drm:ivb_manual_fdi_link_train] *ERROR* FDI train 2 fail!

.. which I didn't see before; the exact same mini-displayport to
displayport cable connecting the same Lenovo X230 laptop to exactly the
same ZRW 30 display worked just fine with the 3.6.0 kernel.

So I bisected the problem, and found the following commit.  Reverting
this commit made the problem go away.   Maybe we should revert
0c96c65b48fb in mainline?

Let me know if you'd like me to do any further debugging.

Thanks,

- Ted

commit 6c34ed3be47036c173f7f43df112f93fbd89026f
Author: Jani Nikula jani.nik...@intel.com
Date:   Wed Sep 26 18:43:10 2012 +0300

drm/i915: use adjusted_mode instead of mode for checking the 6bpc force flag

commit 0c96c65b48fba3ffe9822a554cbc0cd610765cd5 upstream.

The dithering introduced in

commit 3b5c78a35cf7511c15e09a9b0ffab290a42d9bcf
Author: Adam Jackson a...@redhat.com
Date:   Tue Dec 13 15:41:00 2011 -0800

drm/i915/dp: Dither down to 6bpc if it makes the mode fit

stores the INTEL_MODE_DP_FORCE_6BPC flag in the private_flags of the
adjusted mode, while i9xx_crtc_mode_set() and ironlake_crtc_mode_set() use
the original mode, without the flag, so it would never have any
effect. However, the BPC was clamped by VBT settings, making things work by
coincidence, until that part was removed in

commit 4344b813f105a19f793f1fd93ad775b784648b95
Author: Daniel Vetter daniel.vet...@ffwll.ch
Date:   Fri Aug 10 11:10:20 2012 +0200

Use adjusted_mode instead of mode when checking for
INTEL_MODE_DP_FORCE_6BPC to make the flag have effect.

v2: Don't forget to fix this in i9xx_crtc_mode_set() also, pointed out by
Daniel both before and after sending the first patch.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=47621
CC: Adam Jackson a...@redhat.com
Signed-off-by: Jani Nikula jani.nik...@intel.com
Reviewed-by: Adam Jackson a...@redhat.com
Signed-off-by: Daniel Vetter daniel.vet...@ffwll.ch
Signed-off-by: Greg Kroah-Hartman gre...@linuxfoundation.org

- Ted
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [REGRESSION] i915: failure to interoperate with HP ZR30w using an X230

2012-10-30 Thread Theodore Ts'o
On Tue, Oct 30, 2012 at 01:57:27PM +0200, Jani Nikula wrote:
  [1] drm-intel-next-queued branch at 
  git://people.freedesktop.org/~danvet/drm-intel
 
 Hmm, actually not. Either drm-intel-fixes branch, or Linus' master.

Confirmed, the drm-intel-fixes branch from Daniel's tree
(3.7.0-rc2-00031-g1623392) works fine for me.

Do you know which commit(s) are likely to have fixed the problem, so we
can cherry pick the appropriate fix(es) to the 3.6.x tree?

Thanks,

- Ted
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


[REGRESSION] i915: failure to interoperate with HP ZR30w using an X230

2012-10-29 Thread Theodore Ts'o

I recently upgraded to 3.6.3, and my Lenovo X230 has stopped being able
to work with an HP ZR30w 30" 2560x1600 display.   I saw the following
messages in the dmesg:

[drm:ivb_manual_fdi_link_train] *ERROR* FDI train 1 fail!
[drm:ivb_manual_fdi_link_train] *ERROR* FDI train 2 fail!

.. which I didn't see before; the exact same mini-displayport to
displayport cable connecting the same Lenovo X230 laptop to exactly the
same ZRW 30 display worked just fine with the 3.6.0 kernel.

So I bisected the problem, and found the following commit.  Reverting
this commit made the problem go away.   Maybe we should revert
0c96c65b48fb in mainline?

Let me know if you'd like me to do any further debugging.

Thanks,

- Ted

commit 6c34ed3be47036c173f7f43df112f93fbd89026f
Author: Jani Nikula 
Date:   Wed Sep 26 18:43:10 2012 +0300

drm/i915: use adjusted_mode instead of mode for checking the 6bpc force flag

commit 0c96c65b48fba3ffe9822a554cbc0cd610765cd5 upstream.

The dithering introduced in

commit 3b5c78a35cf7511c15e09a9b0ffab290a42d9bcf
Author: Adam Jackson 
Date:   Tue Dec 13 15:41:00 2011 -0800

drm/i915/dp: Dither down to 6bpc if it makes the mode fit

stores the INTEL_MODE_DP_FORCE_6BPC flag in the private_flags of the
adjusted mode, while i9xx_crtc_mode_set() and ironlake_crtc_mode_set() use
the original mode, without the flag, so it would never have any
effect. However, the BPC was clamped by VBT settings, making things work by
coincidence, until that part was removed in

commit 4344b813f105a19f793f1fd93ad775b784648b95
Author: Daniel Vetter 
Date:   Fri Aug 10 11:10:20 2012 +0200

Use adjusted_mode instead of mode when checking for
INTEL_MODE_DP_FORCE_6BPC to make the flag have effect.

v2: Don't forget to fix this in i9xx_crtc_mode_set() also, pointed out by
Daniel both before and after sending the first patch.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=47621
CC: Adam Jackson 
Signed-off-by: Jani Nikula 
Reviewed-by: Adam Jackson 
Signed-off-by: Daniel Vetter 
Signed-off-by: Greg Kroah-Hartman 

- Ted